IEEE TRANSACTIONS ON COMPUTERS,
VOL. 56, NO. 9,
SEPTEMBER 2007
1245
Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion Chi-Chun Hsia, Student Member, IEEE, Chung-Hsien Wu, Senior Member, IEEE, and Jian-Qi Wu Abstract—In emotional speech synthesis, a large speech database is required for high-quality speech output. Voice conversion needs only a compact-sized speech database for each emotion. This study designs and accumulates a set of phonetically balanced smallsized emotional parallel speech databases to construct conversion functions. The Gaussian mixture bigram model (GMBM) is adopted as the conversion function to characterize the temporal and spectral evolution of the speech signal. The conversion function is initially constructed for each instance of parallel subsyllable pairs in the collected speech database. To reduce the total number of conversion functions and select an appropriate conversion function, this study presents a framework by incorporating linguistic and spectral information for conversion function clustering and selection. Subjective and objective evaluations with statistical hypothesis testing are conducted to evaluate the quality of the converted speech. The proposed method compares favorably with previous methods in conversion-based emotional speech synthesis. Index Terms—Emotional text-to-speech synthesis, emotional voice conversion, linguistic feature, function clustering and selection, Gaussian mixture bigram model.
Ç 1
INTRODUCTION
T
HE
ability to express emotions is of priority concern in developing computer-aided communication tools. Rulebased emotional speech synthesis has been investigated with the analysis of acoustic features of emotional speech [1]. For high-quality speech output, concatenative text-tospeech (TTS) systems have been developed with large-sized emotional speech databases of target emotions [2], [3]. To solve the application and development obstructions resulting from the requirement for large-sized speech databases, voice conversion methods have been adopted as a postprocessing block for TTS systems. Kawanami et al. [4] utilized the voice conversion method to convert an utterance from neutral to emotional speech. Cummings and Clements [5] asserted that the emotion of a speech signal is partly determined by the shape of underlying voice source (or glottal excitation waveform). Although prosodic features characterize the main expression in speech, spectral features are also indispensable in emotional speech expression [4]. In the recent decade, although prosodic features have received considerable attention, inadequate spectral conversion, even with good converted prosody, generally produces deficient emotional speech output. This study focuses on the spectral features to improve the quality of the converted emotional speech.
. The authors are with the Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, Taiwan, ROC. E-mail: {shiacj, chwu, glinewu}@csie.ncku.edu.tw. Manuscript received 11 Sept. 2006; revised 5 Feb. 2007; accepted 26 Apr. 2007; published online 11 May 2007. Recommended for acceptance by R.C. Guido, L. Deng, and S. Makino. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCSI-0358-0906. 0018-9340/07/$25.00 ß 2007 IEEE
Abe et al. [6] applied a codebook mapping method from source to target spectral feature vectors for speaker conversion. However, stochastic approaches have dominated the development of voice conversion systems in the recent decade [7], [8], [9], [10], [11], [12]. Two mainstream methods of the Gaussian mixture model (GMM)-based voice conversion are mean square error [7] and joint density [8]. These two methods have nearly equivalent performance [8]. The GMM-based voice conversion is performed by a frame-by-frame procedure with the time-independence assumption and disregards the spectral envelope evolution. Toda et al. [9] introduced a GMM-based framework that considers global variance. Hidden Markov model (HMM)based methods have recently been proposed [10], [11], [12]. The state-transition property in HMM-based methods provides a good approximation of the spectral envelope evolution in the time axis. However, the HMM-based method requires a large amount of training data for robust parameter estimation. Previous stochastic voice conversion systems work with spectral features not only to estimate the conversion function on the spectral feature space but also to transform new source spectral vectors into target vectors. Duxans et al. [10] considered the phonetic information available in a TTS system for each frame, including phone, vowel/consonant flag, point of articulation, manner, and voicing, by adopting a classification and regression tree (CART). The conversion function is trained on each leaf node and the mean square error between the predicted and target features decreases as the number of conversion functions increases [10]. However, CART is a sequence of hard decision processes and does not output either a distance or a similarity score. The CART framework is designed as a frame-based approach, considering only the frame information. Every splitting in a node Published by the IEEE Computer Society
1246
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 56,
NO. 9,
SEPTEMBER 2007
Fig. 1. Flowchart of the proposed approach.
regards only the speech data in that node. All linguistic features are treated as equal and the acoustic similarities between conversion functions are not considered. This study presents a Gaussian mixture bigram model (GMBM)-based [13] voice conversion function to characterize the temporal and spectral evolution in the conversion process. The conversion function is initially constructed for each subsyllable paired instance in the collected speech database. The number of initialized conversion functions equals the number of subsyllable paired instances in the collected speech database. It achieves the minimum mean square error, but the total number of conversion functions is proportional to the size of the collected speech database. The search time of conversion functions increases as the collected speech database increases in size. A K-Meansbased framework is introduced by incorporating the linguistic and spectral information to reduce the total number of conversion functions and select the appropriate conversion function. Linguistic similarities accounting for the context of different functions are estimated on linguistic feature vectors using the cosine measure. Since the conversion functions are derived from the joint distributions on the spectral feature space, the Kullback-Leibler (KL) divergence [14] and the sigmoid function are used to calculate the spectral similarities between conversion functions. For conversion function training, a small-sized speech database was designed and collected for each emotion to cover all 150 subsyllables in Mandarin, consisting of 112 context-dependent initial parts and 38 final parts [15]. A large speech database with neutral style was collected as the unit inventory of the TTS system and the synthetic neutral utterances were applied as the input speech of the emotional voice conversion model for emotional speech synthesis. For the requirement of large range of speech modification, an analysis-by-synthesis scheme is adopted, not only to enlarge the dynamic range of speech modification, but also to ensure the voice quality. This study adopts the Speech Transformation and Representation using Adaptive
Interpolation of weiGHTed spectrum (STRAIGHT) algorithm, proposed by Kawahara et al. [16], [17] to estimate the spectrum and pitch contours of neutral utterances synthesized by the TTS system. The STRAIGHT algorithm is a high-quality analysis and synthesis scheme which uses pitch adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region to remove signal periodicity. This algorithm extracts the fundamental frequency (F0) by Time-domain Excitation extractor using Minimum Perturbation Operator (TEMPO) and designs an excitation source using phase manipulation. Fig. 1 shows the flowchart of the proposed method. In the training phase, the STRAIGHT algorithm is adopted to estimate the spectrum of source and target speeches. After the alignment using dynamic time warping (DTW) algorithm, the initial models (initial conversion functions) for each source and target paired speech segments of the same subsyllable are trained using the Expectation Maximization (EM) algorithm [18]. The linguistic feature vector corresponding to each initial model is extracted and calculated from the text. The initial models are clustered by the K-Means algorithm using spectral and linguistic similarity between conversion functions. The GMBM-based conversion function is estimated for each cluster. In the synthesizing phase, the spectral and prosodic features of synthetic neutral-styled speech are extracted by the STRAIGHT algorithm. The conversion functions are selected according to the spectral features and the extracted linguistic features. Finally, the output speech is synthesized by the STRAIGHT algorithm using converted spectral and prosodic features. The remainder of this paper is organized as follows: Section 2 introduces the GMBM for constructing conversion functions. Section 3 describes the framework of the proposed conversion function clustering and selection method and also introduces the estimation of linguistic and spectral similarities. The experiments and results are described in Section 4. Conclusions are finally drawn in Section 5.
HSIA ET AL.: CONVERSION FUNCTION CLUSTERING AND SELECTION USING LINGUISTIC AND SPECTRAL INFORMATION FOR ...
2
and
THE GAUSSIAN MIXTURE BIGRAM MODEL
The joint density method has been applied to voice conversion by modeling the source and target paired spectral feature vectors in a joint GMM distribution. The conversion function is obtained from the regression [8]. This study adopts the GMBM to characterize temporal and spectral evolution in the conversion function.
2.1
The Gaussian Mixture Model
The source and target feature vector sequences are aligned using the DTW algorithm to obtain the paired feature vector sequence Z ¼ fz1 ; z2 ; ; zT g, where zt ¼ ½x0t ; y0t 0 . xt and yt denote the source and target spectral feature vectors, respectively, with the dimensionality equal to d. The distribution of Z is modeled as M X
pðzt Þ ¼
m¼1
wm pðxt ; yt jmÞ ¼
M X
wm Nðzt ; m ; m Þ;
ð1Þ
m¼1
where wm is the prior probabilities of zt , given the PM component m, and satisfies m¼1 wm ¼ 1. M represents the total number of components. Nðzt ; m ; m Þ denotes the 2d-dimensional Gaussian distribution with the mean vector 0 0 0 m ¼ ½ð X Y m Þ ; ð m Þ and the covariance matrix 2 XX 3 m XY m 5: m ¼ 4 YY YX m m The parameters ðwm ; m ; m Þ for each component can be estimated using the EM algorithm [18]. The conversion function is then given by ~ t ¼ F ðxt Þ ¼ E½yt jxt y M h X XX 1 i YX pðmjxt Þ Y m xt X ¼ m þ m m ;
1247
ð2Þ
2
yt ;yt 6 y ;y t1 t m ¼ 6 4 xt ;y t xt1 ;yt
yt ;yt1 yt1 ;yt1 xt ;yt1 xt1 ;yt1
yt ;xt yt1 ;xt xt ;xt xt1 ;xt
3 yt ;xt1 yt1 ;xt1 7 7: xt ;xt1 5 xt1 ;xt1
The conversion function is then given by (5), shown in Fig. 2. Assuming that yt is independent of xt1 and that xt is independent of yt1 , the conversion function is further simplified as (6), shown in Fig. 2.
3
CONVERSION FUNCTION CLUSTERING SELECTION
AND
This study models the conversion functions according to the characteristics of the linguistic and spectral information. Conversion functions are clustered based on the K-Means algorithm using the proposed linguistic and spectral similarity measures. The clustering process is applied on segment-based conversion functions, rather than directly on the speech data represented in frames since voice conversion is a segment-to-segment conversion process rather than a frame-to-frame one. Each conversion function corresponds to a source and target paired speech segment in the collected parallel speech database. The linguistic similarity is estimated by the cosine measure on the linguistic feature vectors in which elements are weighted to characterize the importance of each linguistic feature element. Since the conversion function is derived from the joint distribution of spectral feature vectors, the spectral similarity is quantified by the KL divergence [14] between distributions and normalized using the sigmoid function. A weighting factor is used to combine the linguistic and spectral similarities linearly between conversion functions.
m¼1
where pðmjxt Þ represents the posterior probability of xt belonging to component m and is represented as , M X X XX XX : wk N xt ; X pðmjxt Þ ¼ wm N xt ; m ; m k ; k
3.1 Function Clustering For each subsyllable, one conversion model with multiple conversion functions is trained using the K-Means algorithm in the following steps. Table 1 shows the K-Means algorithm for conversion function clustering.
k¼1
ð3Þ
1.
2.2 The Gaussian Mixture Bigram Model This study adopts the GMBM [13] to characterize the temporal information into the modeling of spectral conversion. The probability density function of the joint random variable zt ¼ ðyt ; yt1 ; xt ; xt1 Þ is modeled using a mixture model as pðzt Þ ¼
M X m¼1
wn pðyt ; yt1 ; xt ; xt1 jmÞ ¼
M X
wm N ðzt ; m ; m Þ; 2.
m¼1
ð4Þ where m ¼ E ½ yt ; yt1 ; xt ; xt1 0
For each source-target-paired speech segments of the same subsyllable label in the parallel speech database, the conversion function is trained using the EM algorithm [18] and denoted as fi , 1 i I, where I is the total number of paired speech segments of each subsyllable and is dependent on the size of collected parallel speech database. The corresponding joint distribution is represented by gi . The corresponding linguistic feature vector for conversion function fi is quantified according to the context and is represented as li . The initial conversion function for each cluster is randomly chosen from fi , 1 i I, and is denoted as Fj , 1 j J, where J is the total number of clusters. The corresponding joint distribution is given by Gj and the linguistic feature vector is given by Lj .
1248
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 56,
NO. 9,
SEPTEMBER 2007
Fig. 2. Equations (5), (6, (7), and (8).
TABLE 1 K-Means Algorithm for Conversion Function Clustering
3.
The similarity to Fj for each fi is calculated by Simðfi ; Fj Þ ¼ Sspectral ðgi ; Gj Þ þ ð1 Þ Slinguistic ðli ; Lj Þ;
spectral features, as detailed in Section 4.2. For each fi , the most similar conversion function, Fj , is selected and bðiÞ ¼ j is set.
ð9Þ 4.
where Sspectral ðgi ; Gj Þ and Slinguistic ðli ; Lj Þ denote the spectral and linguistic similarities, respectively, and is a weighting factor between these two similarity measures. The value of is determined from the relative error between the predicted and target
5.
The conversion function Fj is reestimated for each cluster by the EM algorithm using the speech data of the conversion function fi with bðiÞ ¼ j and recalculating the corresponding joint distribution Gj and linguistic feature vector Lj . Repeat Steps 3 and 4 until no change is observed in the assignments of bðiÞ in two successive iterations.
HSIA ET AL.: CONVERSION FUNCTION CLUSTERING AND SELECTION USING LINGUISTIC AND SPECTRAL INFORMATION FOR ...
1249
3.2 Linguistic Similarity The linguistic similarity between two conversion functions fi and Fj is estimated by the cosine measure on the corresponding linguistic feature vectors li and Lj as Slinguistic ðli ; Lj Þ ¼ cosðli ; Lj Þ:
ð10Þ
The term li ¼ ½li;1 ; li;2 ; ; li;M represents the linguistic feature vector of conversion function fi , where M is the total number of linguistic features. The cosine measure calculates the similarity according to the angle between two vectors. Two similar vectors need not have the same values in the corresponding dimensions. Each element li;m , 1 m M, is given in a form similar to the termfrequency-and-inverse-document-frequency ðtf idfÞ used in the field of information retrieval [19] as ( 1 þ logðfreqi;m Þ log NKm ; if freqi;m 1 ð11Þ li;m ¼ 0; if freqi;m ¼ 0; where freqi;m denotes the appearance number of the mth linguistic feature in the training data of the conversion function fi , K is the total number of conversion functions, and Nm denotes the number of functions in which the mth linguistic feature appears. If the mth linguistic feature appears in the database, then the element li;m is assigned by the first clause of (11); otherwise, it is set to zero. The formula log K=Nm ¼ log K log Nm gives full weight to linguistic features that appear in one conversion function ðlog K log Nm ¼ log K log 1 ¼ log KÞ. A linguistic feature that appears in all conversion functions has zero weight ðlog K log Nm ¼ log K log K ¼ 0Þ.
3.3 Spectral Similarity Since the conversion functions are derived from the joint distribution of source and target spectral feature vectors, the spectral similarity between conversion functions fi and Fj is estimated by the KL divergence [14] on their corresponding joint probability density function gi and Gj , respectively, and normalized using the sigmoid function as Sspectral ðgi ; Gj Þ ¼ 1
1 ; 1 þ exp DKL ðgi ; Gj Þ
ð12Þ
where is a slope parameter. DKL ðgi ; Gj Þ represents the symmetric KL divergence between two distributions gi and Gj and is given by . ð13Þ DKL ðgi ; Gj Þ ¼ KL gi kGj þ KLðGj kgi Þ 2: In the above equation, KLðgi kGj Þ denotes the KL divergence between two distributions. As the mixture model is adopted, the KL divergence can be approximated by X KLðgi kGj Þ n min KL gi;n kGj;m ; ð14Þ n
m
P P where gi ¼ n n gi;n and Gj ¼ m m Gj;m represent two mixture models with mixture weights n and m , respectively. When Gaussian distribution is adopted for each component, the KL divergence can be calculated as
Fig. 3. Block diagram for conversion function selection.
KLðN ð1 ; 1 ÞkNð 2 ; 2 ÞÞ 1 1 j2 j T 1 log 1 2 Þ 2 ð þ T r 2 1 þ ð ¼ 1 2 Þ d ; 2 j1 j ð15Þ where Nð ; Þ denotes the Gaussian distribution with mean vectors and covariance matrices in a d-dimensional space.
3.4 Selection Process Fig. 3 shows the function selection process. The spectral feature vectors for each new speech segment are extracted by the STRAIGHT algorithm and are detailed in Section 4. Each candidate conversion function Fj trained from the source-target parallel training corpus is described by its linguistic feature vector Lj and source-target joint distribution Gj trained from the training data. The similarity between the input speech segment X and the candidate conversion function Fj is measured as a weighted sum of spectral and linguistic similarities, as illustrated in (9). Since the spectral similarity is measured on joint distributions, for the input speech segment, the candidate converted target ~ j are derived based on the candidate feature vectors Y conversion function Fj . The joint distribution of the input ~ j is source X and the converted target feature vectors Y estimated by the EM algorithm and is used to calculate the spectral similarity to the joint distribution, Gj , which belongs to Fj . The linguistic similarity is estimated between the linguistic feature vector of source X and the linguistic feature vector Lj belonging to Fj . The conversion function with the largest weighted sum of spectral and linguistic similarities is selected as the conversion function for voice conversion.
4
EXPERIMENTS
AND
RESULTS
Several subjective and objective evaluations were conducted using statistical hypothesis testing. Three phonetically balanced small-sized speech databases, each for one emotion, were designed and collected to train the voice conversion models. Natural speech, rather than synthetic speech, was collected and used as the training data. For
1250
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 56,
NO. 9,
SEPTEMBER 2007
TABLE 2 Linguistic Features Used to Calculate Linguistic Similarity
TABLE 3 Descriptive Information of the Emotional Speech Databases
TABLE 4 Acoustic Properties of the Emotional Speech Databases (Number of Sentences: 300; Unit: Syllable)
feature extraction, the mel-frequency cepstral coefficients (MFCCs) were calculated from the smoothed spectrum extracted by the STRAIGHT algorithm. The analysis window was 23 ms with a window shift of 8 ms. The order of the cepstral coefficients was set to 45. Since the proposed method was employed as a postprocessing module of the TTS system, the speech synthesized by the TTS system was used as an input of the voice conversion model. The basic units for synthesizing neutral speech was comprised of 1,410 distinct tonal syllables. Each syllable in Mandarin speech can be phonetically represented as an INITIAL part followed by a FINAL part. The experiment was performed with a speech database consisting of 5,613 sentences, pronounced by a female native speaker with a neutral expression in a professional recording studio. The speaker was a radio announcer and was familiar with our study. All utterances were recorded at a sampling rate of 22.05 kHz and 16-bit resolution. The duration of the collected speech database was 5.46 hours. The intelligibility and mean opinion score (MOS) of the used neutral speech synthesizer were 99.7 percent and 3.94. Table 2 shows the linguistic features adopted to calculate the linguistic similarity, which includes the features in subsyllable, syllable, and word levels. At the subsyllable
level, the subsyllable types of current, preceding, and succeeding subsyllables were considered. The INITIAL part can be categorized into six broad types according to the manner of consonant articulation and the FINAL part can be categorized into 17 types according to the constituent vowel nucleus and nasal ending [20]. At the syllable level, the tone type of a syllable in Mandarin has five categories: tones 1-4 and the neutral tone. The position in a word was classified into four categories. At the word level, the 44 Partof-Speech (POS) types used in this study are a subset of the complete POS set described in [21].
4.1 Speech Database Design and Collection Three target emotions commonly defined and used in emotional TTS synthesis systems, namely, happiness, sadness, and anger, were chosen for experiments. A context-independent subsyllable balanced script was designed for each emotion type. The script was selected from a large sentence pool for each emotion type. Table 3 summarizes the statistical information of the selected text scripts. The same speaker pronounced all of the sentences for the TTS system. All utterances were recorded at a sampling rate of 22.05 kHz and a resolution of 16 bits. Table 4 summarizes
HSIA ET AL.: CONVERSION FUNCTION CLUSTERING AND SELECTION USING LINGUISTIC AND SPECTRAL INFORMATION FOR ...
Fig. 4. The relative errors of the subsyllable /ien/ as a function of the number of clusters.
1251
Fig. 6. Relative error for K-Means-based GMBM with different weighting factors.
databases. The recognition rate of the adopted speakerdependent speech recognizer was 87.3 percent.
Fig. 5. The relative errors of the subsyllable /ien/ as a function of mixture numbers.
the acoustic properties of the collected speech data. The sad speech had the lowest mean F0 and the smallest dynamic range. This finding supports those of previous studies [2], [3]. Moreover, happiness had the shortest syllable duration and anger had the largest dynamic range of the root-meansquare (RMS) energy. Twenty sentences other than those in the collected speech database for each emotion type were also randomly selected as the test set. These sentences were also pronounced by the same speaker and were recorded in the same environment as the target utterances for comparison with the converted utterances. Three subjects listened to the recorded sentences and were required to classify each sentence as spoken with happiness, sadness, and anger. They perceived all sentences correctly. The boundaries of each subsyllable speech segment were aligned by an HMMbased speaker-dependent speech recognizer [22] and then further refined manually. Speaker-independent HMMs were adopted as the initial models to train the speakerdependent HMMs for the speaker who provided the speech
4.2 Objective Test Every sentence in the test set was initially synthesized by the TTS system. The synthesized utterances were further converted using a GMBM-based conversion function clustered by the K-Means algorithm (so-called K-Meansbased GMBM). The conversion function in (6) for GMBM was simplified such that all of the covariance and crosscovariance matrices were diagonal. The conversion models were built for each of the 150 subsyllables, as described in Section 1. The EM algorithm was adopted to train the conversion functions. All of the 300 parallel utterances were used as the training and test data for each emotion type. Fig. 4 illustrates the relative errors of the subsyllable /ien/ as a function of the number of clusters using the GMBMbased conversion function with a mixture number of 4 and a weighting factor of 0.4. These results reveal that the relative error decreases with increasing number of clusters. Fig. 5 shows the relative errors of the subsyllable /ien/ as a function of the mixture numbers using the GMBM-based conversion function with the number of clusters set to 6. The number of clusters and mixtures for each subsyllable was manually fine tuned. Table 5 summarizes the results of the K-Means training, including the number of vectors per group and the number of GMM classes per group. To determine the weighting factor , Fig. 6 depicts the average relative error of the log-spectral distortion of the MFCCs derived from the smoothed STRAIGHT spectrum. The performance index used for testing is given by relative error ¼
TABLE 5 Results of the K-Means Training
X Dð~ 1 M1 ym ; ym Þ ; M m¼0 Dðxm ; ym Þ
ð16Þ
1252
IEEE TRANSACTIONS ON COMPUTERS,
Fig. 7. Relative error for the number of training sentences.
where M is the total frame number of the source speech, xm , ~ m denote the mth frames of the source, aligned ym , and y target, and converted speech, respectively, and DðÞ represents the log-spectral distortion. As shown in Fig. 6, the relative errors achieved their maximum values when only the spectral similarity was used ð ¼ 1Þ. In the range of 0:1 0:5, the happiness style resulted in the worst performance, and the sadness style produced the lowest relative error. The weighting factors for K-Means-based GMBM were set to 0.3, 0.4, and 0.3 for happiness, sadness, and anger, respectively, in the following experiments. The weighting factors for K-Means-based GMM were set to 0.4, 0.3, and 0.3 for happiness, sadness, and anger, respectively. Fig. 7 shows the average relative error as a function of the number of training sentences. Incorporating temporal information in the conversion process yields a lower relative error than the GMM-based method. The proposed K-Means-based framework also resulted in lower distortion than the CART-based methods. Fig. 7 also illustrates the results of K-Means-based GMBM when the real target was used as yt1 , which outperformed the proposed method when the transformed target was used as yt1 . The results indicate that the transformed features were still far from the real target features.
4.3 Subjective Test To evaluate the performance of spectral conversion as a postprocessing module of the TTS system, a GMM-based prosody conversion model [23] along with a pitch target model [24] were adopted to convert the pitch contour for each syllable. Let the syllable boundary denote ½0; D. The adopted pitch target model is illustrated as follows: T ðtÞ ¼ at þ b yðtÞ ¼ expðtÞ þ at þ b 0 t D;
ð17Þ
0;
where T ðtÞ denotes the underlying pitch target and yðtÞ denotes the surface pitch contour. A pitch target model of one syllable can be represented by a set of parameters ða; b; ; Þ. The Levenberg-Marquardt algorithm [25] was utilized for parameter estimation of a nonlinear regression process. The prosody conversion model was built as the regression on the joint GMM distribution of the source and target aligned prosody parameters. The prosodic feature vector of each
VOL. 56,
NO. 9,
SEPTEMBER 2007
Fig. 8. Identification results for different conversion methods. (a) Single GMM, (b) Single GMBM, (c) CART-based GMBM spectral conversion functions, (d) K-Means-based GMBM, (e) Prosody conversion, (f) CART-based GMBM þ prosody conversion, (g) K-Means-based GMBM þ prosody conversion, and (h) K-Means-based GMBM with real target þ prosody conversion.
syllable includes four parameters for pitch contour, one parameter for energy, and one for duration. The STRAIGHT algorithm was used to alter the prosodic features. Each sentence in the test set was synthesized by the TTS system and further converted using the following conversion method for each subsyllable in each emotion type: single GMM spectral conversion function, single GMBM spectral conversion function, CART-based GMBM spectral conversion functions, K-Means-based GMBM spectral conversion functions, 5. prosody conversion, 6. CART-based GMBM spectral conversion functions þ prosody conversion, 7. K-Means-based GMBM spectral conversion functions þ prosody conversion, and 8. K-Means-based GMBM spectral conversion functions with real target þ prosody conversion. All 300 sentence pairs were used to train the spectral conversion models and the GMM-based prosody conversion models for each emotion type. The total number of utterances presented to each listener was 420 ð3 7 20Þ. A double-blind experiment was conducted in the subjective study [26]. For each test sentence randomly selected from the test set, 20 converted utterances processed by each conversion method to each emotion type were randomly output to the human subjects. Twenty adult subjects, aged 22-32, were asked to classify each utterance as one of the three emotion types. The subjects were familiar with our study. Fig. 8 shows the identification results and indicates that K-Means-based GMBM performed better than the CART-based method. Although prosody controls most of the emotional cues in speech, spectral conversion is still helpful in determining the emotional expression. The naturalness of the converted utterances was also evaluated, according to a five-scale scoring method ð5 ¼ excellent; 1 ¼ very poorÞ. Fig. 9 compares various conversion methods with MOS and its standard deviation. Analysis of variance (ANOVA) evaluations were conducted and revealed significant differences between methods with significance levels of p < 0:05 [26]. Fig. 10 gives an example 1. 2. 3. 4.
HSIA ET AL.: CONVERSION FUNCTION CLUSTERING AND SELECTION USING LINGUISTIC AND SPECTRAL INFORMATION FOR ...
1253
Fig. 9. MOS and standard deviation for different conversion methods. (a) Single GMM, (b) single GMBM, (c) CART-based GMBM spectral conversion functions, (d) K-Means-based GMBM, (e) prosody conversion, (f) CART-based GMBM þ prosody conversion, (g) K-Means-based GMBM þ prosody conversion, and (h) K-Means-based GMBM with real target þ prosody conversion.
of the converted spectrum and pitch contour of the sentence “chiau2, ta1 wei2 jiuan3 de0 nung2 hei1 chang2 fa3, hau3 mi2 ren2 o1.”
5
CONCLUSIONS
This study presents a conversion function clustering and selection framework to incorporate linguistic information into the design of spectral conversion process. By the estimation of linguistic and spectral similarities, the K-Means algorithm is adopted into the design for multiple functions conversion model. Linguistic feature vectors are quantized based on the context of each conversion function and the linguistic similarity is estimated by the cosine measure on the feature vectors. Spectral similarity is measured using the KL divergence on the joint distributions of the conversion functions. The GMBM is adopted for joint distribution modeling to characterize the temporal information. Unlike GMM, the GMBM considers the temporal evolution using cross-covariance matrices between the current and previous feature vectors. However, GMBM needs more training data and computational power than GMM. Results of objective experiments confirm that the proposed method outperforms the CART-based method in reducing the distortion between the converted and target emotional speech. The inclusion of linguistic information improves the modeling of conversion functions. Subjective tests reveal that more accurate spectral conversion would improve the expression of emotional speech. This study presents a voice conversion framework for emotional speech synthesis with small-sized speech data. The system performance appears to depend on the size of the training data. Accordingly, the addition of more training data would enhance system performance.
ACKNOWLEDGMENTS The authors would like to thank the Ministry of Economic Affairs of the Republic of China, Taiwan, for financially supporting this research under Contract 92-EC-17-A-02-S1024. Dr. Kawahara is appreciated for providing support for the STRAIGHT analysis/synthesis program.
Fig. 10. Example of the converted spectrum and pitch contour of the sentence “chiau2, ta1 wei2 jiuan3 de0 nung2 hei1 chang2 fa3, hau3 mi2 ren2 o1.” (a) Source speech, (b) converted speech, and (c) target speech.
REFERENCES [1]
I.R. Murray and J.L. Arnott, “Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” J. Acoustic Soc. Am., vol. 93, no. 2, pp. 1097-1108, 1993. [2] M. Schro¨der, “Emotional Speech Synthesis—A Review,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH ’01), vol. 1, pp. 561-564, 2001. [3] A. Iida, F. Higuchi, N. Campbell, and M. Yasumura, “A CorpusBased Speech Synthesis System with Emotion,” Speech Comm., vol. 40, nos. 1-2, pp. 161-187, 2003. [4] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano, “GMM-Based Voice Conversion Applied to Emotional Speech Synthesis,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH ’03), pp. 2401-2404, 2003. [5] K.E. Cummings and M.A. Clements, “Application of the Analysis of Glottal Excitation of Stressed Speech to Speaking Style Modification,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’93), vol. 2, pp. 207-210, Apr. 1993. [6] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice Conversion through Vector Quantization,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’98), pp. 655-658, 1988. [7] Y. Stylianou, O. Cappe´, and E. Moulines, “Continuous Probabilistic Transform for Voice Conversion,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, 1998. [8] A. Kain and M.W. Macon, “Spectral Voice Conversion for Text-toSpeech Synthesis,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’98), 1998. [9] T. Toda, A.W. Black, and K. Tokuda, “Spectral Conversion Based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’05), vol. 1, pp. 9-12, Mar. 2005. [10] H. Duxans, A. Bonafonte, A. Kain, and J. van Santen, “Including Dynamic and Phonetic Information in Voice Conversion Systems,” Proc. Int’l Conf. Speech and Language Processing (ICSLP ’04), pp. 5-8, 2004. [11] E.K. Kim, S. Lee, and Y.H. Oh, “Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker,” Proc. European Conf. Speech Comm. and Technology (EUROSPEECH ’97), vol. 5, pp. 2519-2522, Sept. 1997.
1254
IEEE TRANSACTIONS ON COMPUTERS,
[12] C.H. Wu, C.C. Hsia, T.H. Liu, and J.F. Wang, “Voice Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1109-1116, 2006. [13] W.H. Tsai and W.W. Chang, “Discriminative Training of Gaussian Mixture Bigram Models with Application to Chinese Dialect Identification,” Speech Comm., vol. 36, no. 3-4, pp. 317-326, 2002. [14] S. Kullback and R.A. Leibler, “On Information and Sufficiency,” Annals of Math. Statistics, vol. 22, no. 1, pp. 79-86, Mar. 1951. [15] C.H. Wu and J.H. Chen, “Automatic Generation of Synthesis Units and Prosodic Information for Chinese Concatenative Synthesis,” Speech Comm., vol. 35, nos. 3-4, pp. 219-237, 2001. [16] H. Kawahara, “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP ’97), vol. 2, pp. 1303-1306, 1997. [17] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne´, “Restructuring Speech Representations Using a Pitch Adaptive TimeFrequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds,” Speech Comm., vol. 27, nos. 3-4, pp. 187-207, 1999. [18] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977. [19] C.D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999. [20] S.H. Chen, S.H. Hwang, and Y.R. Wang, “An RNN-Based Prosodic Information Synthesis for Mandarin Text-to-Speech,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 3, pp. 226-239, 1998. [21] L.L. Chang et al., “Part-of-Speech (POS) Analysis on Chinese Language,” technical report, Inst. of Information Science Academia Sinica, 1989. [22] C.H. Wu and Y.J. Chen, “Recovery of False Rejection Using Statistical Partial Pattern Trees for Sentence Verification,” Speech Comm., vol. 43, pp. 71-88, 2004. [23] J. Tao, Y. Kang, and A. Li, “Prosody Conversion from Neutral Speech to Emotional Speech,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145-1154, 2006. [24] X. Sun, “The Determination, Analysis, and Synthesis of Fundamental Frequency,” PhD dissertation, Northwestern Univ., 2002. [25] K. Levenberg, “A Method for the Solution of Certain Problems in Least Squares,” Quarterly Applied Math., vol. 2, pp. 164-168, 1944. [26] S. Shott, “Statistics for Health Professionals,” W.B. Sauders, 1990.
VOL. 56,
NO. 9,
SEPTEMBER 2007
Chi-Chun Hsia received the BS degree in computer science from National Cheng Kung University, Tainan, Taiwan, in 2001. Currently, he is a PhD student in the Department of Computer Science and Information Engineering at National Cheng Kung University, Tainan, Taiwan. His research interests include digital signal processing, text-to-speech synthesis, natural language processing, and speech recognition. He is a student member of the IEEE.
Chung-Hsien Wu received the BS degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1981 and the MS and PhD degrees in electrical engineering from the National Cheng Kung University, Tainan, Taiwan, Republic of China, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering at National Cheng Kung University, Tainan, Taiwan. He became a professor in August 1997. From 1999 to 2002, he served as the chairman of the department. He also worked at the Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Cambridge, in the summer 2003 as a visiting scientist. He is the editor in chief for the International Journal of Computational Linguistics and Chinese Language Processing. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for the hearing impaired. He is a senior member of the IEEE and a member of the International Speech Communication Association (ISCA) and ROC Computational Linguistics Society (ROCLING). Jian-Qi Wu received the BS degree in information engineering from I-Shou University, Kaohsiung, Taiwan, in 2004 and the MS degree in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 2006. His research interests include digital signal processing, natural language processing, and text-to-speech synthesis.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.