Exploiting Prosody Hierarchy and Dynamic ... - Semantic Scholar

4 downloads 0 Views 2MB Size Report
Manuscript received November 19, 2008; revised October 15, 2009. Date of publication April .... tour modeling and synthesis method that was based on discrete.
1994

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010

Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis Chi-Chun Hsia, Chung-Hsien Wu, Senior Member, IEEE, and Jung-Yun Wu

Abstract—This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and regression tree (S-CART). The S-CART is trained by maximizing the proportional reduction of entropy to minimize the errors in the prediction of the prosodic breaks. The pitch contour of a speech sentence is estimated using the STRAIGHT algorithm and decomposed into the prosodic features (static features) at prosodic word, syllable, and frame layers, based on the predicted prosodic structure. Dynamic features at each layer are estimated to preserve the temporal correlation between adjacent units. A hierarchical prosody model is constructed using an unsupervised CART (U-CART) for generating pitch contour. Minimum description length (MDL) is adopted in U-CART training. Objective and subjective evaluations with statistical hypothesis testing were conducted, and the results compared to corresponding results for HMM-based pitch modeling. The comparison confirms the improved performance of the proposed method. Index Terms—Dynamic features, hidden Markov model (HMM)-based speech synthesis, pitch modeling and generation, prosody hierarchy.

I. INTRODUCTION

S

PEECH technology will be essential to the next generation of human–machine interaction. In the last decade, much effort has been made to develop high-quality speech synthesis using corpus-based and statistical-model-based approaches. Generally, a text-to-speech (TTS) system is logically composed of three main elements—text/linguistic analysis, prosodic information or model parameter generation, and speech synthesis. Text analysis is firstly utilized to transcribe letters into phonemes and extract the linguistic features for Manuscript received November 19, 2008; revised October 15, 2009. Date of publication April 05, 2010; date of current version September 01, 2010. This work was supported by the National Science Council of the Republic of China, Taiwan, under Contract NSC96-2221-E-006-155-MY3. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gaël Richard. C.-C. Hsia is with the ICT-Enabled Healthcare Program, Industrial Technology Research Institute—South, Tainan 709, Taiwan (e-mail: [email protected]. tw). C.-H. Wu and J.-Y. Wu are with the Department of Computer Science and Information Engineering National Cheng Kung University, Tainan 701, Taiwan (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2010.2040791

speech synthesis. The linguistic features are then used to generate the prosodic information or model parameters, which are pitch contour, energy contour, and duration. Finally, speech is synthesized by concatenating the selected synthesis units with modified prosody or by generating the speech output from the model parameters using statistical models. The two major types of method for corpus-based speech synthesis are unit selection-based and statistical model-based. In a unit selection-based method, the unit selection criterion and the cost function critically determine the quality of the synthetic speech. Tejedor et al. compared the use of grapheme and phoneme-based speech units [1]. Wu et al. adopted variablelength units for synthesizing high-quality speech [2]. Various combinations of acoustic, prosodic and phonological costs have been investigated for designing cost functions [3], [4]. However, for unit selection-based TTS, a large amount of speech data is required in the synthesis phase to obtain various speech characteristics. During the last decade, methods of statistical speech synthesis, that are based on hidden Markov models (HMMs) [5]–[7] in which only pre-trained models are required to synthesize smooth and intelligible speech, have been developed. Although the naturalness of the speech synthesized by HMM-based speech synthesis remains unsatisfactory, this method can flexibly capture speaking styles and can be used in small, portable systems. In statistical methods, model parameters can be adjusted, enabling them to flexibly adapt to speakers and emotions [7]. Recently, the HMM-based method has come to dominate speech synthesis. Prosody is evidently one of the most important features to be exploited in improving the naturalness of synthesized speech, since prosody is an inherent supra-segmental feature of human speech and carries the paralinguistic features such as intent, emotion, formality, contrastive emphasis, speaking style, and others. During the last few decades, a wide range of approaches to pitch prediction have been developed. They include rule-based methods [8], [9] and statistical methods including HMM [10]. Pan et al. [11] utilized regression analysis to estimate the functions for predicting syllable duration, volume and the intonation of a sentence. Multiple linear regression functions are learned using the information of words, phrases, breath groups, and sentence in a hierarchical manner. Chen et al. [12] introduced a recurrent neural network (RNN)-based prosody synthesis method for predicting pitch contour, energy and sub-syllable/pause duration. Kim et al. [13] adopted vector quantization (VQ) to establish a Word Tone Dictionary for Korean TTS system. Classification and regression tree (CART)

1558-7916/$26.00 © 2010 IEEE

HSIA et al.: EXPLOITING PROSODY HIERARCHY AND DYNAMIC FEATURES FOR PITCH MODELING AND GENERATION

and Markov chain have been utilized to predict the pitch target labels, which are then further converted into an intonation contour [14]. Wu et al. [15] introduced a template-based prosody modeling method. Pitch contours were quantized as polynomial coefficients and recorded in a template tree, which was constructed by considering the tone combination, word length, and POS of a word. Tao [16] also proposed a template-based model that was based on prosody cost function, as well as a statistical training method to assign and adapt weights in template selection. Rule-based methods require considerable work by experts and are difficult to adapt to various speakers, emotions or speaking styles. In statistical methods, pitch prediction models are estimated from the training corpus, and are adaptable. Prosody can be regarded as a top-down hierarchical structure, including discourses, sentences, breath groups, prosodic phrases, prosodic words and syllable/sub-syllables. Many approaches take into account prosody hierarchy in predicting pitch [11], [17]–[21]. In such approaches, the inclusion of prosody hierarchy evidently has improved prosody modeling. For example, in Tseng’s study [17], the success of the prediction of duration of syllables was only around that expected by chance, but cumulatively, over 90% of the duration output was accounted for. Pan et al. also developed a hierarchical approach [11] by modeling the pitch, duration, and intensity based on the information of prosodic words, phrases, breath groups, and sentences. The prosody of fluent speech comprises intonational phrases, prosodic phrases, breath groups, prosodic words, and syllables [17]. It is not a linear system in which individual prosodic units are concatenated, but a top-down hierarchical structure for controlling intonation. Pin et al. [22] adopted the phrase groups in prosodic units and the Fujisaki model [23] with prosody hierarchy information for phrase command modeling. HMM-based approaches also utilize prosody hierarchy information, including breath groups and accentual phrases in training of state-tying decision tree [24]. ToBI is a widely used method of representing pitch contours. It provides a symbolic representation at the phonological layer [25]. The limitation of ToBI is that no automated analysis algorithms currently exist and an expert must manually annotate the corpus. Models other than ToBI for representing a surface of pitch contours have also been proposed, including those of Fujisaki [23] and Tilt [26] models. These models provide succinct, deterministic algorithms for generating pitch contours, but the associated analysis processes all depend on event detection. Since all of these methods depend on the appearance of events at unrestricted points, individual parameters are not comparable and no trivial distance measures in the parameter spaces exists [27]. Teutenberg et al. [27] introduced a pitch contour modeling and synthesis method that was based on discrete cosine transform (DCT). Latorre et al. [28] modeled the dynamic features of DCT coefficients, including the delta of the zeroth DCT coefficient and the gradient of the syllable average pitch at the joints between adjacent syllables. In this paper, since the syllable-based pitch contour in Mandarin speech is smooth and simple, the discrete Legendre polynomials are sufficient to characterize the pitch contour effectively. They are therefore adopted to represent pitch [29]. In the discrete Legendre polynomials of order four, the first three coefficients represent

1995

the mean, the gradient, and the curvature of the pitch contour. Consequently, the distortion between two pitch contours can be easily estimated from the Euclidean distance between the Legendre coefficients of the two pitch contours. This study presents a pitch prediction method for HMM-based Mandarin speech synthesis. A hierarchical prosodic structure is utilized as the basis of the pitch prediction model to treat the prosodic units as the supra-segmental units in the hierarchical structure. Two basic prosodic units are considered in the design of the pitch prediction model—prosodic word and syllable. The dynamic pitch features of units in the prosodic structure are estimated to preserve the temporal correlation between adjacent units and obtain a more natural pitch transition in the conjunction segment. The dynamic pitch features are estimated at prosodic word, syllable and frame layers. Since the Legendre polynomial coefficients are linear such that the Euclidean distance between two Legendre coefficients can be used to measure the distance between two contours. The differences between the gradients and curvatures of adjacent units (prosodic words or syllables) that are denoted by the high-order Legendre coefficients tentatively maintain their temporal correlation in a speech sentence. For example, if the dynamic range of a pitch contour is large for a happy speech utterance, then the larger gradients and curvatures are likely to exhibit temporal correlation between two adjacent units (prosodic words or syllables) in the sentence. The differences between the gradients and curvatures would probably result in lower value. In tonal language, the modeling of gradient and curvature differences can help to model changes of consecutive tone patterns. In this paper, Speech Transformation and Representation using the Adaptive Interpolation of weiGHTed spectrum (STRAIGHT) algorithm, proposed by Kawahara et al. [30], [31], is adopted to estimate the spectrum and pitch contours and thus synthesize speech with high quality. The STRAIGHT algorithm is a high-quality analysis and synthesis algorithm that combines pitch adaptive spectral analysis with a surface reconstruction method in the time–frequency domain to eliminate signal periodicity. This algorithm extracts the fundamental frequency (F0) using a Time-domain Excitation extractor with a Minimum Perturbation Operator (TEMPO) and is thus used to design an excitation source based on phase manipulation. As shown in Fig. 1, a database of speech and corresponding text sentences is used as the training corpus. The text sentences are labeled with prosodic breaks. A supervised classification and regression tree (S-CART) is trained to predict prosodic breaks. The pitch contour of a sentence is analyzed using the STRAIGHT algorithm and decomposed based on its labeled hierarchical prosodic structure. The dynamic features are estimated at prosodic word, syllable and frame layers. A hierarchical pitch model, comprising two U-CARTs, one at prosodic word layer and one at the syllable layer, and an HMM at the frame layer, is trained to predict pitch in the dynamic feature space. Fig. 2 depicts the synthesis procedure which has two main parts—generation of the cepstrum and duration, and prediction of the pitch. The hierarchical prosodic structure of the input text sentence is predicted using the pre-trained prosodic break detection model. Mel-cesptrum and duration are

1996

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010

The pitch residuals obtained by polynomial fitting to a prosodic word are further used to calculate the pitch feature of each sylrepresent the pitch residuals of the th syllable lable. Let with frames. The corresponding pitch feature vector is calculated using the Legendre polynomial of order four as

(2) where

represents the Legendre polynomial basis

Fig. 1. Training phase of the proposed pitch prediction method.

(3) Fig. 2. Synthesis phase of the proposed method.

generated using HMMs, and the pitch parameter is predicted using a hierarchical pitch model in the dynamic feature space. The output speech is synthesized using a mel log spectrum approximation (MLSA) filter [32]. The rest of this paper is organized as follows. Section II introduces the top-down pitch contour decomposition process and dynamic feature extraction based on the constructed prosodic structure. Section III elucidates the hierarchical modeling of pitch and the generation of pitch contours. Section IV presents the experimental results and compares the proposed method to the HMM-based pitch modeling. Section V draws conclusions. II. PITCH DECOMPOSITION AND FEATURE EXTRACTION A. Pitch Contour Decomposition The pitch contour of a sentence with prosodic words . The pitch conis denoted tour in a prosodic word with frames is represented as . The pitch feature vector for the th , is calculated by encoding prosodic word, using the discrete Legendre polynomials [29] of order two

(1)

are Additionally, the pitch residuals from the syllable layer used as the features in pitch modeling at the frame layer. Fig. 3 presents an example of pitch contour decomposition from the prosodic word layer down to the frame layer. If the zero pitch values in prosodic words (PWs) were used in the contour parameterization, then the fitted contours would tend to approximately zero. To reduce the error in the parameterization generated by using zero pitch values, the pitch contour is linearly interpolated prior to parameterization. As presented in the figure, the residual from the PW layer fitting is applied as the target in syllable-layer fitting, and the residual of the syllable layer fitting is further modeled by the proposed frame-layer model. In pitch decomposition, the prosodic word and syllable boundaries were manually labeled and fixed at each layer. The fitted contour at the prosodic word layer represents the intonation tendency of the prosodic word. At the syllable layer, the fitted contour is combined with the prosodic word layer contour to capture the variation of the pitch in the representation of the tone of the syllable. B. Dynamic Feature Estimation To preserve the temporal correlation between adjacent units, dynamic features are calculated at the prosodic word, syllable, and frame layers. In this paper, the prosodic feature vector for the th prosodic word at the prosodic word layer is assumed to comprise the static pitch feature vector of, for example, Legendre polynomial coefficients and the dyand of, for example, namic pitch feature vectors

HSIA et al.: EXPLOITING PROSODY HIERARCHY AND DYNAMIC FEATURES FOR PITCH MODELING AND GENERATION

1997

(8)

(9) is the dimensionality of . and represent the zero matrix and identity matrix, respectively. The dynamic features at syllable and frame layers are estimated similarly. III. HIERARCHICAL PROSODIC MODELING AND PITCH GENERATION A hierarchical prosodic model for modeling and generating pitch from the top down is elucidated. Pitch models at prosodic word and syllable layers are separately constructed based on two unsupervised CARTs (U-CARTs), and HMM is used for frame-layer pitch modeling. A. Prosodic Structure Construction

Fig. 3. Curve fitting example for pitch contour decomposition.

delta and delta-delta Legendre polynomial coefficients, respectively); therefore, . The dynamic feature vectors are estimated as

(4)

(5)

and represent the ranges where the pairs over which the delta and delta-delta dynamic feature vectors, respectively, are estimated. In this paper, is set to (1, 1) and the weight for the dynamic feature estimation is set to . represents the difference between the preceding and the next static feature vectors. Similarly, and are set to . denotes the sum of the differences (1, 1) and and the preceding one and between the static feature vector the static feature vector and the next one. Simply in matrix form, the dynamic features are estimated as

Fig. 4 depicts an example of prosodic (upper part) and lexical structures (lower part) for a Mandarin sentence. The prosodic structure captures more information on the intonation than does the lexical structure. The prosodic structure better represent the breath breaks and the intonation of the utterance. However, in Mandarin, the prosodic word and phrase breaks generally occur at lexical word boundaries, although these boundaries may not all be the prosodic word/phrase breaks. The prosodic structure is obtained by firstly predicting the prosodic phrase breaks and the prosodic word breaks of the input sentence. The algorithm for predicting prosodic breaks that was proposed by Chu et al. [21] is adopted in this work. In the training corpus, each boundary between consecutive syllables is labeled as either a break or a non-break between prosodic phrases/words. A two-layered supervised CART, including a prosodic word (PW) S-CART and a prosodic phrase (PP) S-CART, is constructed, as presented in Fig. 5. First, all of the syllable boundaries are used for predicting PW breaks, and all of the predicted PW breaks are further used to predict PP breaks. Finally, each syllable boundary is categorized as a PP break, a PW break, or a non-break. The contextual features of each boundary between two consecutive syllables are extracted using a window with a size of five syllables (three before and two after the boundary), and are presented in Table I. The contextual features are extracted by a text analyzer, which incorporates the word segmentation and phonological analysis modules that were proposed by Wu et al. [15]. The Stanford parser [33] is also adopted for POS tagging. To maximize the purity of the data samples in the child nodes at each splitting, the S-CART is trained by maximizing the proportional reduction of entropy (PRE), which is calculated as

(6) (10) where (7)

where and are the numbers of the training data in the and parent node and the th child node, respectively, and

1998

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010

Fig. 4. Example showing the difference between the prosodic word (PW) breaks and the lexicon word (LW) breaks. The prosodic structure is shown above the speech signal, and the syntactic structure is shown below the speech signal.

sample in the parent node into the left and right child nodes. The best question is selected for each node to minimize the randomness or variation among the data samples in the leaf nodes in the dynamic feature space. The minimum description length (MDL) [34] with an assumption that data samples are normally distributed is adopted as the criterion for best question selection

(12) Fig. 5. Block diagram for prosodic structure construction.

are the corresponding entropies. In each node, the entropy is estimated as

(11) where is the observation probability of class in the node. represents the total number of classes. B. Prosodic Word and Syllable Layer Pitch Modeling Prosodic word and syllable layer pitch models are constructed in dynamic feature space using the U-CART. For each leaf node, a Gaussian model is constructed in the dydenotes the mean vector and namic feature vector space. is the covariance matrix. Fig. 6 shows an example of the U-CART for pitch modeling. Each data sample comprises a dyand a contextual feanamic feature vector . Syntactic, linguistic, and phonetic ture vector information are utilized as the attributes in the contextual feature vector, and are used as the splitting question to split the data

is the difference between the description lengths where and are the of the child and parent nodes. numbers of dynamic feature vectors in the parent, left and right and denote child nodes, respectively, and represents the dithe corresponding covariance matrices. mensionality of the dynamic feature vector, and is a weighting factor that controls the penalty term according to the amount of training data. The difference between the description lengths of the child and parent nodes represents the reduction in randomness when data samples split from the parent node into the left and right child nodes. The contextual feature with the is selected as the splitting question in the parent lowest node. A text analyzer is adopted to extract the contextual features, including the features presented in Table I that are used in S-CART and the prosodic break information that was predicted by S-CART. C. Frame Layer Modeling Dynamic features at the frame layer hidden Markov models, as

are modeled using

(13)

HSIA et al.: EXPLOITING PROSODY HIERARCHY AND DYNAMIC FEATURES FOR PITCH MODELING AND GENERATION

1999

TABLE I CONTEXTUAL FEATURES FOR S-CART MODELING

Fig. 7. Pitch contour generation from dynamic features.

where (16) (17) where and are the mean vector and covariance matrix of the model , respectively. To synchronize between two layers, the boundaries of prosodic words and syllables are aligned to the syllable boundaries that are generated by the speech synthesizer. The generated pitch contours from each layer are added of the to generate the final predicted pitch contour sentence: (18) where and denote the generated pitch values at the prosodic word, syllable, and frame layers at time , respectively. Fig. 6. Pitch modeling for prosodic word and syllable layers.

IV. EXPERIMENTS AND RESULTS where is the set of HMM parameters, and is the state sequence. and represent the observation probability of the dynamic features given state sequence , and the prior probability of the state sequence, respectively. The expectation–maximization (EM) algorithm with the maximum-likelihood criterion [35] is adopted to estimate the parameters in HMM training. D. Generation of Pitch Contours In the synthesis phase, a contextual feature vector sequence that contains syntactic, linguistic, and phonetic information is extracted at the prosodic word, syllable, and is seframe layers. The model sequence lected from the pre-trained pitch models, as shown in Fig. 7. The best static feature vector sequence is calculated by maximizing given model sequence the observation probability of using

(14) where is an zero matrix. Solving the above equation yields the estimated static feature sequence (15)

Several experiments were conducted to evaluate the performance of the proposed method for predicting pitch. Five thousand sentences were randomly chosen from the TsingHua Corpus of Speech Synthesis (TH-CoSS) [36], of which 3000 were used to train the model (Set 1), 1000 were used to tune the weighting factor (Set 2) and 1000 were used for testing (Set 3). The sub-corpus that was provided by a female speaker was used in the following experiments. The prosodic boundaries of the corpus were labeled manually; the labels of the various subjects were carefully examined for consistency. The labeled prosodic structure in the corpus has five layers—utterance, sentence, prosodic phrase, prosodic word, and syllable [36]; only the prosodic word and syllable boundaries were applied herein. Table II presents the basic statistics for the three sets of the speech corpus. The HMM-based speech synthesis system (HTS) [5] was downloaded from the HTS website (http://hts.sp.nitech.ac.jp/) to construct the Mandarin HMM-based speech synthesizer, and the smooth spectrum that was extracted by the STRAIGHT algorithm was used to calculate the mel-cepstral coefficients of order 25. The tonal phone set that has been utilized in the Mandarin speech recognizer [37], [38] was adopted in the HMM-based Mandarin speech synthesizer. Table III shows the 107 segmental tonal phone models that were adopted in this work. The numbers of states and mixtures of each HMM were five and one, respectively.

2000

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010

TABLE II BASIC STATISTICS OF SPEECH DATA SETS: SET 1 FOR TRAINING, SET 2 FOR TUNING, AND SET 3 FOR TESTING

TABLE III SEGMENTAL TONAL PHONE MODELS IN HMM-BASED MANDARIN SPEECH SYNTHESIZER

TABLE IV EVALUATION OF PROSODIC STRUCTURE CONSTRUCTION

TABLE V PITCH MODELING AND GENERATION METHODS

The states of all HMMs were modeled in decision trees to enable state sharing. Information on 30 types of contextual feature, including phones, syllables, prosodic words/phrases, and sentences were further distributed among 1200 questions in decision tree training. The total number of leaf nodes was 34 681. A. Objective Evaluation of Prosodic Structure Construction The speech corpus includes 65 934 lexical word boundaries and 42 167 PW breaks. Of the PW breaks, 42 134 occur in the lexical word boundaries. A lexical word boundary does not absolutely imply a PW break. To evaluate the performance of S-CART in prosodic structure construction, the precision rate, recall rate, and accuracy were calculated as follows: Precision rate Recall rate Accuracy

Count Count Count

Count Count

(19) (20)

B. Assessment of Weighting Factor Table V shows the following four methods for modeling and generating pitch, which are assessed objectively and subjectively. 1) HMM: HMM-based pitch model. 2) U-CART : Single layer (syllable layer) pitch model with dynamic features. 3) U-CART : Hierarchical pitch model without dynamic features. : Hierarchical pitch model with dynamic 4) U-CART features. To determine the weighting factors in U-CART , and U-CART , mean distance between U-CART the target and the predicted dynamic feature vectors is used:

Count (21)

where and denote the predicted and the correctly predicted breaks at layer , respectively. In this paper, breaks include PP breaks, PW breaks, and non-breaks (syllable boundrepresents the true break at layer in the test data set. aries). Table IV presents the results of the evaluations of the break prediction. The overall accuracy of break prediction is 84.37%.

(22) and are the target and the predicted feature vecwhere tors for the th prosodic unit. Database Sets 1, 2, and 3 were used for training, tuning, and testing, respectively. Table VI presents the contextual features used in U-CART modeling. The

HSIA et al.: EXPLOITING PROSODY HIERARCHY AND DYNAMIC FEATURES FOR PITCH MODELING AND GENERATION

2001

TABLE VI CONTEXTUAL FEATURES IN U-CART MODELING

TABLE VII WEIGHTING FACTOR VALUES FOR U-CARTS

Fig. 8. Mean distance as a function of the weighting factor at PW layer for . “Inside”: the distance in the training data set; “Outside”: the U-CART distance in the test data set.

degree of connection denotes whether the syllables belong to a single prosodic word and prosodic phrase. The contextual features that were used in cepstrum HMM training were adopted in the HMM-based pitch model. They include the linguistic and predicted prosodic features and are presented in Table VI. Figs. 8 and 9 depict the mean distance and the number of leaf nodes as functions of the weighting factor at the PW layer for the U-CART . “Inside” refers to the distance in the training data set, and “outside” refers to the test data set. As shown in the figures, the mean distance decreases and the number of leaf nodes increases as increases. Table VII presents the values of the weighting factors used in the following evaluations at the PW and SYL layers. C. Objective Evaluation Root mean squared errors (RMSE) are used to assess objectively the proposed method Fig. 9. Number of leaf nodes as a function of weighting factor at PW layer for . U-CART

RMSE

(23)

and denote where the target and predicted pitch contours, respectively. Table VIII presents the results of the RMSE evaluation. The proposed methods yielded a lower RMSE than the HMM-based method for the tuning set (Set 2) and the test set (Set 3). However, the hi, erarchical pitch model with dynamic features, U-CART cannot reduce the prediction error in the RMSE evaluation. Fig. 10 plots the predicted pitch contours of the U-CART and the HMM for the sentence/jhong guo2 yi jiou3 ci er4 nian2 zeng4 song4 gei3 mei 3 guo2 de5 da4 syong2 mao sing sing was closer cyu4 jhan3 chu/. The contour of the U-CART to the target than that of the HMM. D. Subjective Evaluation A double-blind experiment on subjective evaluation [39] was conducted. Six subjects were asked to choose the more natural sounding of a pair of utterances. Each of 15 utterances was

TABLE VIII RMSE ASSESSMENT FOR OBJECTIVE EVALUATION

Fig. 10. Pitch contour comparison between U-CART

and HMM.

synthesized using four methods. The order of presentation of

2002

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 8, NOVEMBER 2010

Fig. 11. Results of the preference test.

the paired synthesized speech outputs to be compared was randomly determined. The subjects had no information concerning the method used to synthesize the speech. and the Fig. 11 shows the results. The U-CART methods did not yield more natural speech U-CART than the HMM-based method. The z-score, which is the statistic used in the test of bi-nominal population distribution, was employed to assess the significance of the preference. A higher z-score denotes greater significance. The z-scores of and HMM versus the results for HMM versus U-CART were 28.8 and 64.8, respectively. However, comU-CART bining the dynamic features and the hierarchical pitch model, outperformed the HMM-based pitch model. the U-CART The naturalness of the synthesized speech was improved by incorporating these two kinds of information in pitch modutilized the dynamic features only at the eling. U-CART utilized dynamic features from syllable layer. U-CART the prosodic word layer to the frame layer. The improvement probably arose from the smooth pitch generation capacities when dynamic features are utilized at each layer. Comparing with the U-CART reveals that a single layer the U-CART with dynamic features slightly outperforms the hierarchical model without dynamic features in naturalness. The z-score was 21.6. The mean opinion scores (MOS) of the four methods were obtained using the same synthesized speech sentences as were used in the preference test. Six subjects who had participated in the preference test also participated in MOS evaluation. The evaluation was conducted eight months after the preference test. The MOS values of the HMM-based, methods were U-CART , U-CART , and U-CART , and , respec(with tively. A t-test yielded a significance layer of 178 degrees of freedom), indicating significant differences between any two pitch modeling methods. V. CONCLUSION AND FUTURE WORK This study proposes a hierarchical model for predicting pitch with dynamic features for use in speech synthesis. The prosodic structure is constructed as the basis of pitch contour decomposition for modeling pitch features. In dynamic feature space, pitch prediction models are trained at prosodic word, syllable, and frame layers. The experimental results reveal that the prediction

of the pitch contour was significantly improved. The incorporation of prosody hierarchy and dynamic features improves the modeling of pitch contours in the synthesis of Mandarin speech. According to this work, for a tonal language, the tone-context contains much information about the pitch, and the deltas of the Legendre polynomial coefficients can provide a hint concerning how much the pitch shape of one syllable should be modified in a manner that depends on the surrounding syllables. However, for a non-tonal language the basic pitch shape of the syllable is unknown. Future work should investigate the effect of the polynomial coefficient deltas of syllables for a non-tonal language entails. The goal of the generation process is to minimize the error between the predicted and the target values. The minimum generation error approach that was presented by Wu et al. [40] can be adopted to eliminate the inconsistency between the training and synthesis phases of training and synthesis in the HMM. A higher-layer prosodic structure can be considered in the design of the pitch prediction model. However, higher-layer prosody modeling depends on a large speech corpus with a manually tagged prosodic structure and U-CART training of the proposed method requires higher-layer contextual information. The proposed method outperformed the baseline system using the same contextual information, although the former uses additional model parameters for PW- and syllable-layer U-CARTs. Model complexity should further be considered in model training. ACKNOWLEDGMENT The authors would like to thank Dr. Kawahara for helping with the STRAIGHT analysis/synthesis program, as well as Dr. Tokuda for providing the HTS speech synthesis program. REFERENCES [1] J. Tejedor, D. Wang, J. Frankel, S. King, and J. Colás, “A comparison of grapheme and phoneme-based units for Spanish spoken term detection,” Speech Commun., vol. 50, no. 11–12, pp. 980–991, Nov.–Dec. 2008. [2] C.-H. Wu, C.-C. Hsia, J.-F. Chen, and J.-F. Wang, “Variable-length unit selection in TTS using structural syntactic cost,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1227–1235, May 2007. [3] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP’96, 1996, pp. 373–376. [4] A. W. Black and N. Campbell, “Optimizing selection of units from speech database for concatenative synthesis,” in Proc. Eurospeech’95, 1995, pp. 581–584. [5] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, and K. Tokuda, “The HMM-based speech synthesis system version 2.0,” in Proc. ISCA SSW6, Bonn, Germany, Aug. 2007. [6] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” in Proc. ICASSP, Jun. 2000, pp. 1315–1318. [7] J. Yamagishi, T. Nose, H. Zen, Z. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, “Robust speaker-adaptive HMM-based text-to-speech synthesis,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1208–1230, Aug. 2009. [8] L. S. Lee, C. Y. Tseng, and M. Ouh-young, “The synthesis rules in a chinese text-to-speech system,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 9, pp. 1309–1319, Sep. 1989. [9] A. Benijamin, S. Chilin, and S. Richard, “A corpus-based Mandarin text-to-speech synthesizer,” in Proc. ICSLP’94, Yokohama, Japan, Sep. 1994, pp. 1771–1774. [10] L. Andrej and F. Frank, “Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, no. 5, pp. 1074–1080, Oct. 1986.

HSIA et al.: EXPLOITING PROSODY HIERARCHY AND DYNAMIC FEATURES FOR PITCH MODELING AND GENERATION

[11] N. H. Pan, W. T. Jen, S. S. Yu, M. S. Yu, S. Y. Huang, and M. J. Wu, “Prosody model in a Mandarin text-to-speech system based on a hierarchical approach,” in Proc. IEEE Int. Conf. Multimedia Expo, NY, Jul. 2000, vol. 1, pp. 448–451. [12] S. H. Chen, S. H. Hwang, and Y. R. Wang, “An RNN-based prosodic information synthesizer for Mandarin text-to-speech,” IEEE Trans. Acoust., Speech, Signal Process., vol. 6, no. 3, pp. 226–269, May 1998. [13] S. H. Kim and J. Y. Kim, “Efficient model of establishing words tone dictionary for korean TTS system,” in Proc. Eurospeech, Rhodes, Greece, Sep. 1997, pp. 243–246. [14] M. Dong and K. T. Lua, “Pitch contour model for chinese text-tospeech using CART and statistical model,” in Proc. ICSLP’02, Denver, CO, Sep. 2002, pp. 2405–2408. [15] C. H. Wu and J. H. Chen, “Automatic generation of synthesis units and prosodic information for chinese concatenative synthesis,” Speech Commun., vol. 35, pp. 219–237, 2001. [16] J. Tao, “F0 Prediction model of speech synthesis based on template and statistical method,” in Lecture Nodes of Artificial Intelligence. New York: Springer, 2004. [17] C. Y. Tseng, S. H. Pin, Y. Lee, H. M. Wang, and Y. C. Chen, “Fluent speech prosody: Framework and modeling,” Speech Commun., vol. 46, no. 3–4, pp. 284–309, 2005. [18] X. Sun, “The determination, analysis and synthesis of fundamental frequency,” Ph.D. dissertating, Northwestern Univ., Evanston, IL, 2002. [19] C. Y. Tseng and Y. L. Lee, “Speech rate and prosody units: Evidence of interaction from Mandarin Chinese,” in Proc. Int. Conf. Speech Prosody, Nara, Japan, Mar. 2004, pp. 251–254. [20] S. H. Chen, W. H. Lai, and Y. R. Wang, “A statistics-based pitch contour model for Mandarin speech,” J. Acoust. Soc. Amer., vol. 117, no. 2, pp. 908–925, 2005. [21] M. Chu and Y. Qian, “Locating boundaries for prosodic constituents in unrestricted Mandarin texts,” Comput. Linguist. Chinese Lang. Process., vol. 6, no. 1, pp. 61–82, 2001. [22] S. H. Pin, Y. L. Lee, Y. C. Chen, H. M. Wang, and C. Y. Tseng, “A Mandarin TTS system with an integrated prosodic model,” in Proc. ISCSLP’04, Hong Kong, Dec. 2004, pp. 169–172. [23] H. Fujisaki and K. Hirose, “Analysis of voice fundamental frequency contours for declarative sentence of japanese,” J. Acoust. Soc. Japan (E), vol. 5, no. 4, pp. 233–241, 1984. [24] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMMbased speech synthesis,” in Proc. Eurospeech’99, Budapest, Hungary, Sep. 1999, pp. 2347–2350. [25] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbertand, and J. Hirschberg, “TOBI: A standard for labeling english prosody,” in Proc. ICSLP’92, Banff, AB, Canada, Oct. 1992, pp. 867–870. [26] P. Taylor, “The tile intonation model,” in Proc. ICSLP’98, Sydney, Australia, Nov. 1998, pp. 1383–1386. [27] J. Teutenberg, C. Watson, and P. Riddle, “Modeling and synthesising F0 contours with the discrete cosine transform,” in Proc. ICASSP’08, Las Vegas, NV, Mar. 2008, pp. 3973–3976. [28] J. Latorre and M. Akamine, “Multilevel parametric-based F0 model for speech synthesis,” in Proc. Interspeech’08, Brisbane, Australia, Sep. 2008, pp. 2274–2277. [29] S. H. Chen and Y. R. Wang, “Vector quantization of pitch information in Mandarin speech,” IEEE Trans. Commun., vol. 38, no. 9, pp. 1317–1320, Sep. 1990. [30] H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” in Proc. ICASSP’97, Munich, Germany, Apr. 1997, vol. 2, pp. 1303–1306. [31] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch adaptive time-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol. 27, no. 3–4, pp. 187–207, 1999. [32] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. ICASSP ’92, San Francisco, CA, Mar. 1992, vol. 1, pp. 137–140. [33] D. Klein and C. D. Manning, “Fast exact inference with a factored model for natural language parsing,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, pp. 3–10, 15 (NIPS 2002). [34] K. Shinoda and T. Watanabe, “MDL-based context-dependent subword modeling for speech recognition,” J. Acoust. Soc. Japan (English), vol. 21, pp. 79–86, Mar. 2000.

2003

[35] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977. [36] L. H. Cai, D. D. Cui, and R. Cai, “TH-CoSS, a Mandarin speech corpus for TTS,” J. Chinese Inf. Process., vol. 21, no. 2, pp. 94–99, Mar. 2007. [37] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental tonal modeling for phone set design in Mandarin LVCSR,” in Proc. ICASSP’04, Montreal, QC, Canada, May 2004, pp. 901–904. [38] T. Lin and L. J. Wang, Phonetic Tutorials. Beijing, China: Beijing Univ. Press, 1992, pp. 103–121. [39] S. Shott, “Statistics for health professionals,” W. B. Sauders Com., 1990. [40] Y.-J. Wu and R.-H. Wang, “Minimum generation error training for HMM-based speech synthesis,” in Proc. ICASSP’06, Toulouse, France, May 2006, pp. 89–92.

Chi-Chun Hsia received the B.S. and Ph.D. degrees in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2001 and 2008, respectively. Since September 2008, he has been with the Industrial Technology Research Institute. His research interests include digital signal processing, text-to-speech synthesis, natural language processing, speech recognition, and multimodal sensor data fusion.

Chung-Hsien Wu (M’88–SM’03) received the B.S. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1981, and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, National Cheng Kung University. He became a Professor and Distinguished Professor in August 1997 and August 2004, respectively. From 1999 to 2002, he served as the Chairman of the Department. Currently, he is the Deputy Dean of the College of Electrical Engineering and Computer Science, National Cheng Kung University. He also worked at Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Cambridge, in summer 2003 as a Visiting Scientist. He was the Editor-in-Chief for International Journal of Computational Linguistics and Chinese Language Processing from 2004 to 2008. Dr. Wu serves as the Guest Editor of ACM Transactions on Asian Language Information Processing, IEEE TRANSACTION ON AUDIO, SPEECH AND LANGUAGE PROCESSING, and the EURASIP Journal on Audio, Speech, and Music Processing in 2008–2009. He is currently the Subject Editor on Information Engineering of the Journal of the Chinese Institute of Engineers (JCIE) and on the Editorial Advisory Board of The Open Artificial Intelligence Journal. His research interests include speech recognition, text-to-speech, and spoken language processing. He is a member of the International Speech Communication Association (ISCA). He has been the President of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) since September 2009.

Jung-Yun Wu received the B.S. degree in computer science from National Cheng Kung University, Tainan, Taiwan, in 2006, and the M.S. degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2008. Her research interests include digital signal processing, natural language processing, and text-to-speech synthesis.