IMPROVEING HMM BASED SPEECH SYNTHESIS BY REDUCING ...

IMPROVEING HMM BASED SPEECH SYNTHESIS BY REDUCING OVER-SMOOTHING PROBLEMS Meng Zhang(1) Jianhua Tao(2) Huibin Jia(3) Xia Wang(4) (1)(2)(3)

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (4) Nokia Research Centre, China {(1)mzhang, (2)jhtao, (3)hbjia}@nlpr.ia.ac.cn (4)[email protected] HTS and pointed out that the quality of synthetic speech is degraded byy three main factors: vocoder, modeling tth accuracy, and over-smoothing. nd over-s Many efforts have been carried out to upgrade any research esearch eeff the quality off HTS. To iimprove model structure accuracy, some ome methods ds such as hidden semi-Markov models trajectory HMMs [9] have been (HSMMs) trajector Ms) [8], and traj investigated. stigated. New training fframeworks have also been generation error (MGE) criterion derived, gen ved, e.g. minimum ge [10].. Other works fo focus on the over-smoothing problem widely considered caused by parameter generation generate smooth speech parameter algorithm which can c trajectories, jectories, such su as conditional speech parameter generation algorithm [11 and speech parameter generation algorithm ithm [11] considering deri global variance [12]. Although there have been a lot of works proposed to Alth improve the quality of synthetic speech of HTS in many impro ways, the correlation between these factors and the wa performance of HTS is still not very clear. We still don’t know how these factors affect the quality of synthesized speech. Quality degradation after vocoding process is a problem that every parameter generation based speech synthesis methods need to overcome. The alleviation of this problem depends on the progress of speech coding technique. This factor will not be considered in this paper. As for the modeling accuracy and the over-smoothing, the first one is more like a reason and the second one is more like a result. Poor modeling accuracy may cause over-smoothed model parameters, and further result in quality degradation of synthesized speech. In our work, we analyze modeling accuracy by separating it into two parts: model structure accuracy and training algorithm accuracy. And the oversmoothing is also classified into two types: the oversmoothing in time domain and the over-smoothing in frequency domain [13]. The paper uses two methods to reduce the influence from the model structure accuracy and training algorithm accuracy respectively. By comparing the results, we try to investigate which factors are important to influence the performance of HTS. With the experiments, we finally find that over-smoothing in frequency domain is the main factor

ABSTRACT

Pr

oo

f

Although Hidden Markov Model based speech synthesis has been proved to have good performance, there are still some factors which degrade the quality of synthesized speech: vocoder, model accuracy and over-smoothing. This paper analyzes these factors separately. Modifications for removing different factors are proposed. Experimental results show that over-smoothing in frequency domain main mainly affect the quality of synthesized speech whereas hereas over-smoothing in time domain can nearly be ignored. ed. Time domain over-smoothing is generally caused by model del structure accuracy problem and frequency domain overversmoothing is caused by training algorithm accuracy problem. problem Currently used model structure is capable able of representing presenting speech without quality degradation. ML-estimation mation based parameter training algorithm causes ess distortion of perception perc in speech synthesis. Modification ion parameter on for improving parame training algorithm is more likely kely ely to improve the synthesizing performance. Index Terms— Hidden Markov Model, sp speech synthesis 1. INTRODUCTION

In the last few years, a kind of statistical parametric speech synthesis method has been proposed and given widely attention [1]. In this method, hidden Markov Model (HMM) is used for modeling spectrum, pitch and duration of speech simultaneously in a unified framework [2]. Smoothed parameter sequences are generated from HMMs with considering dynamic features for synthetic speech [3]. HMM based speech synthesis (HTS) has been proved to have good performance [4][5][6], and several advantages. For instance, it can generate flexible characteristics of synthetic voice which can be easily modified. HTS systems are almost language-independent with small footprint comparing to unit selection synthesis systems. However, they still have some drawbacks with buzzy and muffled synthesized speech sounds. In [7], A. Black has made a detailed analysis on the advantages and disadvantages of

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE

17

which influences the quality of synthesized speech and it is generally caused by training algorithm accuracy problem. The rest part of this paper is organized as follows: Section 2 analyzes factors which degrade the quality of synthesized speech. Then two methods for reducing the influences from model structure accuracy and training algorithm accuracy are proposed. In section 3, we make some experiments for two methods. By comparing the synthetic results of different type of modifications, the influence of each factor on quality degradation is studied. Finally, the conclusion of this paper is presented in Section 4.

(a) real spectrum

(b) synthesized spectrum

Fig.1 the over-smoothing problem in frequency domain

2. RUDUCING OVER-SMOOTHING PROBLEMS 2.1. Over-smoothing problems and modeling accuracy In our former work, we have classified the over-smoothing problems of HMM into two types according to their presentation: the over-smoothing in time domain and the over-smoothing in frequency domain [13], as shown in Figure 1 and Figure 2. By comparing the LSF trajectories of synthesized speech and the real speech, the frequency cy domain over-smoothing makes distortion of the formant rmant structure within each frame of synthetic speech. The he oversmoothing in frequency domain is mainly caused by inappropriate parameter training algorithm which leadss to spectrum distortion. The most frequently y used training ning algorithm for HMM is ML-estimation which hich is only nly able to generate parameters with mean and covariance ce of model states. The time domain over-smoothing moothing may make the synthetic speech lost detailed ed spectrum presentation for fo phoneme structures. In HMM M based speech synthesis, model structure of hidden Markov model is usually three or five fiv effective states, left to right with th no skip [5][6]. The ooversmoothing in time domain is mainly l ly caused caus by such limited model structure and parameter generation algorithm taken ration algorit account of constraints between statistic an dynamic istic and features, which make the trajectory of spectral parameter pec less expressive in time domain.

(a) (b) synthesized LSP trajectory a) real LSP SP trajectory trajec

f

Fig.2 the over-smoothing problem in time domain over-smoo HMMs with discrete HMMs 2.2.1. Combine continuous HM H

Pr

oo

In this his part of work, we are trying to reduce the training accuracy problems by b combining the continuous HMMs HMMs [13]. In this method, we replace the with discrete HMM in continuous HMM with the output Gaussian functio aussian function functi probability abili of codevector in discrete HMMs, which resolves over-smoothing problem in frequency domain. Then, the the over-sm vercontinuous spectrum parameters are represented as discrete continuo codevector indexes by vector quantization (VQ). Discrete codev HMMs are constructed to obtain the output probability of HM codevector and multi-space probability distribution HMMs (MSD-HMM) are constructed to generate the formant trajectory, which are both important criterions for the codevector selection. In addition, some statistic results such as the concatenation probability of codevectors are also taken into accounted. More detailed description of this work can be found in [13]. In this paper, we are trying to find the influence of training algorithm, the spectrum distortion from the codevector has to be avoided. Then, we construct a very large vodewords which contains all speech frames of full training corpus. We replace the mean of each state generated by continuous HMM by nearest spectrum from real speech. The method is illustrated in Figure 3. We believe there are minimum training accuracy problem in this synthetic speech which is marked as MS_speech for the later experiments.

2.2. Modification of HMMs To analysis which modeling accuracy gives more influence on the over-smoothing problems and the general quality of synthesized speech, we use two methods by reducing the influence of model structure accuracy and training accuracy respectively. By comparing the results of their synthesized results, we will know the relationship between the speech quality and the different factors.

2.2.2. Increase the amount of HMM states If we can construct a model which is able to contain detailed spectrum distribution for each frame, there will be

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE

18

Fig. 4 Preference score

Fig. 3 overview of the HTS system combining continuous HMMs with discrete HMMs

120

120

100

100

80

80

60

60

40

40

20

no problem for the over-smoothing in time domain. But unfortunately, we only can use some limited HMM states to model a phoneme. Although the unit selection method can solve the time domain over-smoothing problem, we have to use a very large speech corpus. The only method we can do is that we can increase the amount states for HMM to reduce the influence caused by the model structure accuracy. The typical state amount of HMM for Chinese phonemes is threee or five. The first one is usually used for single vowels while hile the second is mainly used for compound vowels. Thiss kind of model structure works very well for mostt speech recognition systems, however the accuracy is not ot enough to simulate the detailed spectrum distribution off the phoneme eme in time domain. To get the influence of the he model structure accuracy, we increase the state amount nt to five (for single vowels) and seven (for compound vowels). owels). In the later experiments, we are trying find iff there is any differen difference of the synthesized results between (with n the traditional model (w three HMM sates) and this revised model (with five and seven states). We mark the results from traditional HMM as HTS_speech, and the results from the revised model as rev TA_speech.

20

0

0

-20

-20

-40

-40

-60 -80

-60

0

100 0

200 20

300

400

500

600

3.5

-80

0

100

300

400

500

600

3.5

3

3

2.5

2.5

2

1.5

oo f

2

1.5

1

1

0.5

0.5

0

200

0

2.5

0

0

0.5

1

1.5

2

2.5

traject trajectory of MS_speech and TA_speech Fig. 5 LSF trajec

3.2. Experimental results Experime Expe

Pr

Preference scores are shown in Figure 4. From the results Preferen can see that, four sets of speech are generally divided we ca into three groups: (1) RL_speech; (2) MS_speech; (3) TA_speech, HTS_speech. Group (1) has much higher preference score than group (2) and (3). In group (3), TA_speech is nearly the same with HTS_speech. Figure 5 shows the LSF trajectory of MS_speech and TA_speech. Simultaneously considering Figure 4 and Figure 1, what interesting is by comparing to RL_speech and HTS_speech, MS_speech is only a little over-smoothing in frequency domain but more in time domain; whereas TA_speech is only a little over-smoothing in time domain but much in frequency domain.

3. EXPERIMENTS FOR STUDYING INFLUENCE ON INFLUE G INF QUALITY DEGRADATION ION 3.1. Experiment setup

For analyzing relationship between quality of synthetic speech and each factor, four sets of speech are generated: RL_speech (real speech), MS_speech, TA_speech and HTS_speech (HTS generated speech). Each set have 42 test sentences. Sentences in each set have parallel context and prosody information but different speech quality. 10 subjects participated in the test. They were presented a pair of two set of parallel speech in random order and asked which one has better quality. In order to avoid vocoder factor in subjective test, all real speech (RL_speech) are also represented through vocoding process without changing any spectral feature.

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE

3.3. Discussion Figure 5 shows that MS_speech only has a little oversmoothing in frequency domain but much in time domain comparing to HTS_speech and RL_speech. This implies that frequency domain over-smoothing is mainly caused by training algorithm accuracy problem since MS_speech alleviated this factor. Similarly, according to comparison between TA_speech and HTS_speech, time domain oversmoothing is mainly caused by model structure accuracy problem. This conclusion agrees with the intuition.

19

representing speech without quality degradation. MLestimation based training algorithm causes distortion in perception for synthesizing speech. Modification for improving parameter training algorithm is more likely to improve the synthesizing performance. These conclusions can guide further modifications and improvements for HTS. 5. ACKNOWLEDGEMENTS Fig. 6 Illustration of the influence affecting quality of synthesized speech

The work was supported by the National Natural Science Foundation of China (No. 60575032) and 863 Programs (No. 2006AA01Z138, No. 2006AA01Z194)

Further more, the results of preference score shows that although model structure accuracy problem causes time domain over-smoothing, it nearly doesn’t affect the quality of synthesized speech. The main reason for quality degradation is over-smoothing in frequency domain which is induced by training algorithm accuracy problem. We can present the final conclusion by Figure 6. Model parameter training algorithm for HMM used in HTS is based on ML-estimation, which is applied in speech recognition originally. For speech recognition, parameters of HMM need to be “medial” enough to represent data distribution of unit such as phoneme. It is unnecessary to consider perception or quality of model parameters. But ut for speech synthesis, too “medial” may not be good forr quality of synthesized speech but induce distortion in perception. on. Parameters for synthesis sometimes need to bee “individual” ual” for generating expressive speech. This may ay be the reason ason that ML-estimation based training algorithm m causes frequency domain over-smoothing and nd degrade de the quality of synthesized speech. The improvement ement ment made by Minimum Min generation error (MGE) criterion ion on training for HTS [10] which modifies the training algorithm is another proof oof this conclusion. On the other err hand, this conclusion is also an explanation for the success off MGE. The high preference score of MS_spe comparing to MS_speech comparin TA_speech demonstrates that the model structure generally ge used in HTS is capable of representing enting speech without quality degradation. Currently, the parameter rameter generation ramet algorithm ignores the dependency between each dimension of spectral parameter so it may introduce spectral distortion and affect the speech quality.

6. REFERENCES

Pr

oo

f

[1] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter a, T generation from HMM using dynamic features,” in Proc. of rom HM ICASSP, 1995, 660–663. 660– 5, pp. 660 [2] T. Yoshimura, Tokuda, T. Masuko, T. Kobayashi, and T. mura, K. Tok Kitamura, of spectrum, pitch and duration amura, “Simultaneous mo multaneou modeling inn HMM-based speech synthesis,” synthesi in Proc. of Eurospeech, 1999, syn vol. 5, pp. 2347-2350 2347-23 [3] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, mura, “Speech parameter generation algorithms for HMMProc. of ICASSP 2000, vol.3, pp.1315– based synthesis,” Pro d speech synthesis, 1318,, June 2000. M. Nakamura, and T. Tokuda, “Details of the [4] H. Zen, T. Toda, M Nitech HMM-based speech synthesis system for the Blizzard Challenge llenge 2005,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 1, pp. 2007. 325–333, 33, 2007 [5] H.. Zen, Ze T. Toda, and K. Tokuda, “The Nitech-NAIST HMMspeech synthesis system for the Blizzard Challenge 2006,” in spe based sp Blizzard Challenge Workshop, 2006. Blizza [6] Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, “USTC system for Blizzard Challenge 2006 an improved HMMbased speech synthesis method,” in Blizzard Challenge Workshop, 2006. [7] A. Black, H. Zen and K. Tokuda, ”Statistical parametric speech synthesis,” Proc. of ICASSP, pp.1229-1232, Apr. 2007. [8] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” in Interspeech, 2004, pp. 1185–1180. [9] H. Zen, K. Tokuda, and T. Kitamura, “An introduction of trajectory model into HMM-based speech synthesis,” in ISCA SSW5, 2004. [10] Y.-J. Wu and R.-H. Wang, “Minimum generation error training for HMM-based speech synthesis,” in ICASSP, 2006, pp. 89–92. [11] T. Masuko, K. Tokuda, and T. Kobayashi, “A study on conditional parameter generation from HMM based on maximum likelihood criterion,” in Autumn Meeting of ASJ, 2003, pp. 209– 210. [12] T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” in Eurospeech, 2001, pp. 2801–2804. [13] Jian Yu, Meng Zhang, Jianhua Tao, Xia Wang“A novel hmmbased TTS system using both continuous HMMs and discrete”, in Proc. ICASSP 2007

4. CONCLUSION This paper analyzes the factors which affect the quality of synthesized speech of HTS: vocoder, modeling accuracy and over-smoothing. Modeling accuracy is separated into two parts: model structure accuracy and training algorithm accuracy which respectively caused time domain oversmoothing and frequency domain over-smoothing. Modifications for reducing different factors are proposed. Experimental results shows that over-smoothing in frequency domain mainly affects the quality of synthesized speech whereas over-smoothing in time domain can nearly be ignored. Currently used model structure is capable of

978-1-4244-2942-4/08/$25.00 (c) 2008 IEEE

20