Using Hybrid HMM-based Speech Segmentation to

0 downloads 0 Views 163KB Size Report
the improvement in naturalness and intelligibility it offers. ..... the development time while preserving satisfying quality of the synthesized speech signals. Rating.
Using Hybrid HMM-based Speech Segmentation to Improve Synthetic Speech Quality Iosif Mporas, Alexandros Lazaridis, Todor Ganchev and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory Dept. of Electrical and Computer Engineering, University of Patras 26500 Rion, Greece Abstract-The automatic phonetic time-alignment of speech databases is essential for the development cycle of a Text-toSpeech (TTS) system. Furthermore, the quality of the synthesized speech signals is strongly related to the precision of the produced alignment. In the present work we study the performance of a new HMM-based speech segmentation method. The method is based on hybrid embedded and isolated-unit trained models, and has proved to improve the phonetic segmentation accuracy in the multiple speaker task. Here it is employed on the single speaker segmentation task, utilizing a Greek-speech database. The evaluation of the method showed significant improvement in terms of phonetic segmentation accuracy as well as in the perceptual quality of synthetic speech, when compared to the baseline system.

I. INTRODUCTION Over the last years, corpus-based concatenative speech synthesis is the most widely used method in Text-to-Speech (TTS) systems [1]. This method has become popular due to the high quality of synthetic voice that provides, as well as due to the improvement in naturalness and intelligibility it offers. Corpus-based concatenative speech synthesis shows supremacy over other TTS methods as a result of the use of corpus-based prosodic rules and efficient search algorithms for unitselection. The main characteristic of the corpus-based TTS methods is the use of large databases. Specifically, the use of large databases with annotation transcriptions allows the selection and concatenation of the appropriate sequences of units for the construction of the synthetic speech signal. By this way, the original speech signals are not significantly modified or corrupted, resulting to synthetic speech signals close to the original ones. The main problem of the corpus-based approaches is the need for an annotated database. Particularly, large databases are required to ensure that an appropriate occurrence of the unit we are looking for during the selection process exists in the database, i.e. a specific unit with specific left and right context and specific prosody. Thus, as larger the database is, so more possible is to find and select an appropriate unit occurrence and produce natural synthetic speech. Another drawback is that in order to create new synthetic voices, different speaking styles or update an existing voice to a new domain, new recordings have to be done. Much of the effort needed to build a unit-selection TTS is spent in the preparation of the database. Most of the speech synthesis systems utilize phones as units, during the selection

process. Thus, the speech database has to be annotated on phonetic level. The phonetic transcriptions can be derived from the word transcriptions, which are usually available from the recording prompts, utilizing a pronunciation dictionary or letter-to-sound rules. The most difficult part of the database preparation is the phonetic time-alignment of the speech recordings. Phonetic time-alignment, or explicit segmentation of speech [2], is usually performed manually by expert phoneticians, since it is the most precise way to detect phonetic boundaries. However, manual phonetic segmentation of speech waveforms is a tedious, expensive and time-consuming task. Thus, several methods for automatic segmentation of speech signals have been proposed. Depending on the availability of word/phonetic transcription one can choose between linguistically constrained or unconstrained segmentation methods. Since, when building a TTS the database consists of prompted speech, here we consider the linguistically constrained case (explicit segmentation). Several explicit segmentation approaches have been proposed in the literature. In [3], Malfrere et al. proposed an alignment of synthetic speech against natural speech, using the dynamic time warping (DTW) algorithm. Keshet et al. [4] introduced a phonetic alignment algorithm based on discriminative learning. In [5], Torkkola described a method for automatic alignment of speech waveforms using neural networks followed by boundary refinement using heuristic speech-specific knowledge. The most commonly and successfully used approach for the task of explicit segmentation is based on the well established and widely used, in speech processing, hidden Markov models (HMMs). In [6], Pellom and Hansen examine HMM-based segmentation performance under noise conditions. In [7], Brugnara et al. present HMM-based architecture for speech segmentation. In [8], Adell et al. do a comparative study of automatic phone segmentation methods for text-to-speech. Finally, in [9] Mporas et al., introduced a hybrid HMM-based method for speech segmentation, consisting of iterative isolated-unit training of phone recognizers, initialized from embedded training. The hybrid HMM-based method has proved to significantly improve the speech segmentation performance in the case of TIMIT [10] multiple speakers’ database. In the present work we examine the performance of this method for the purpose of building a high-quality synthetic voice. Specifically, we

feature sequence SPEECH DATA

TEXT TRANSCRIPTION

FEATURE EXTRACTION

LETTER TO SOUND CONVERTER

HMM PHONE RECOGNIZER

VITERBI TIME ALIGNMENT

PHONETIC LABELS

phone sequence

Fig. 1. Block diagram of the baseline HMM-based speech segmentation method.

comparatively evaluate the performance of the baseline HMMbased method and the hybrid HMM-based one, both in terms of segmentation performance and in terms of perceptual quality of the synthetic speech by human listeners. The outline of this paper is as follows. In Section II we provide an overview of the two evaluated segmentation methods, i.e. the baseline and the hybrid HMM-based timealignment approaches. Section III provides a detailed description of the experimental setup that was followed and the speech database that was used. The experimental results of the evaluation are presented in Section IV. Section V concludes this work. II. METHODS DESCRIPTION In the present section we review the baseline HMM-based phonetic segmentation method, as well as the hybrid HMMbased method, initially proposed in [9]. A. Baseline HMM-based speech segmentation A typical structure of the HMM-based automatic phonetic segmentation method is illustrated in Fig. 1. In HMM-based segmentation a phone recognizer is employed for segmenting the speech signals. In detail, the word level transcription of the speech utterance is typically converted to the corresponding phonetic sequence, using a letter-to-sound converter. A speech parameterization technique is used to decompose the speech signals to sequences of feature vectors. Subsequently, the Viterbi algorithm [11] is utilized to force-align the parametric vector sequences against the corresponding phonetic transcriptions, with respect to the HMM models of the phone recognizer, i.e. the HMM models of the corresponding to the phonetic sequence are concatenated into a unified HMM network, and each feature vector is mapped to one the network’s HMM states. Depending on the availability or not of bootstrap training data with phonetic time-alignment, two major approaches for the training of the HMM-based phone models exist: isolatedunit and embedded training. When bootstrap data are available, isolated-unit training is performed, using the Viterbi algorithm, where each HMM model is trained exclusively from the speech segments of the corresponding phone and all HMMs are trained independently from each other. When bootstrap data are not available embedded training is applied. Embedded training does not require any prior knowledge of the phone

Fig. 2. Block diagram of the hybrid HMM-based speech segmentation method.

boundaries for the bootstrap data set. Instead, the training data are flat-initialized and simultaneously re-estimated through the Baum-Welch [12] algorithm. It has been shown that isolated-unit training results in more accurate segmentation comparing to the embedded one [13]. However, when for example we build a new voice, new speech data are recorded and phonetic time-alignment is needed. In such cases, the alternatives are (i) to manually annotate the whole database, (ii) to manually annotate a part of the database, to construct a bootstrap training set and automatically segment the rest recordings (a problem that may arise here is how large this part should be) or (iii) to rely on embedded techniques to train the phone models. B. Hybrid HMM-based speech segmentation The above mentioned drawbacks are alleviated with the hybrid HMM training method proposed in [9]. The block diagram of the hybrid HMM training method for speech segmentation is illustrated in Fig. 2. The method takes advantage of the capability of embedded techniques to train HMMs without requiring any information about the phonetic time-alignment. On the other hand, it exploits the capability of isolated training to more accurately model the spectral characteristics of the phones. As Fig. 2 illustrates, a first set of HMM models is initially trained using the above described embedded HMM training. This first phone recognizer is used to segment the speech waveforms of interest to the corresponding phones. Afterwards, the extracted phone boundary predictions are used as reference boundary positions. An iterative process begins, where isolated-unit training of the HMM models is performed, and the predictions of the previous iteration serve as reference phone boundary positions for the next iteration. The iterative training is terminated when the overall boundary shift between two successive iterations reach a predefined threshold. The process described so far, involving both embedded and isolated techniques, leads to automatically estimated phone

labels, whose annotations are refined iteratively. This hybrid architecture can be applied directly to training and timealignment of speech data, or alternatively, to training HMMbased phone models from a bootstrap subset and then exploit them to segment other speech waveforms. III. EXPERIMENTAL SETUP To evaluate the performance of the baseline and the hybrid HMM-based segmentation methods as well as their effect on perceptual quality of the synthetic speech signals, we utilized the HTK [14] hidden Markov model toolkit and the WCL-1 prosodic database [15]. A. The Speech Database The present performance evaluation was carried out utilizing WCL-1 prosodic database [15]. WCL-1 database is considered as a linguistically and prosodically rich corpus, appropriate for training speech synthesis systems. The contents of the corpus were extracted from passages, literature, newspapers and/or were set up by a professional linguist. WCL-1 database consists of 5500 words distributed in 500 paragraphs, each one of which may be a single word, a short sentence, a long sentence, or a sequence of sentences. The speech waveforms consist of recordings of a 30 years old Greek female professional radio actress. The final corpus includes 390 declarative sentences, 44 exclamation sentences, 36 decision questions and 24 Wh questions. All speech recordings were sampled with frequency 44.1 kHz and were further downsampled to 16 kHz for the needs of our evaluation. B. The Phone Models During pre-processing, all speech waveforms were frame blocked every 5 milliseconds, utilizing a 20 milliseconds Hamming window. Pre-emphasis with factor equal to 0.97 was performed, employing a first-order FIR filter. For every speech frame, we computed the 12 first Mel frequency cepstral coefficients [14] and the 0-th cepstral coefficient. The delta and double delta coefficients of the 13 static parameters referred above where appended. Thus, we consider a feature vector composed of 39 speech parameters. All words of the evaluation corpus were converted to their corresponding phone sequencies, utilizing a set of 35 phones. This phone set is a modification of the SAMPA [16] alphabet

Segmentation Method Baseline Baseline Baseline Baseline Hybrid Hybrid Hybrid Hybrid

for the Greek language. All speech waveforms were manually annotated in phonetic level by an expert in Greek phonetics. Each phone was modelled by a left-to-right HMM, with 3 states and without skipping transitions. The states of the HMMs were modelled by 1, 2, 4, and 8 continuous Gaussian distributions. It has been shown in the literature that in the task of phone segmentation context-independent HMMs achieve higher segmentation accuracy [17, 18]. Therefore in the present evaluation, we consider context-independent HMMs. IV. EXPERIMENTAL RESULTS In all experiments we used the whole speech data both to train the HMM phone models and time-align the corresponding sequences of phonetic labels. As a first step we measured the performance of the two HMM-based methods in terms of segmentation accuracy and afterwards we examined the effect of the automatically produced phone boundaries to the perceptual quality of synthetic speech signals. A. Speech Segmentation Performance The segmentation accuracy was evaluated in terms of tolerance, i.e. the percentage of the boundaries which were predicted within a distance smaller than t milliseconds from the hand-labelled boundaries. Additionally, we measured the performance of the two evaluated phone segmentation methods in terms of mean absolute error (MAE). The obtained results are shown in Table I. In each case, the score of the best performing segmentation method is indicated with bold. As can be seen in Table I, the hybrid method outperformed the baseline HMM-based segmentation, similarly to [9], where the method had been tested on recordings from multiple speakers. As the experimental results indicate, the best performance, in terms of MAE, was achieved for 1 mixture in both methods. The results are in equivalence to the case of 20 milliseconds of tolerance, which is considered an acceptable limit for producing good quality synthetic speech [19, 20]. The superior performance of the HMM models with fewer mixtures of Gaussians is due to the inherent variance of the spectrum in the vicinity of a phonetic transition, which could make a simpler model more adequate [17]. Another reason could be the amount of data, i.e. some phones might not have enough occurrences to robustly train HMM state distributions with many Gaussians.

TABLE I SPEECH SEGMENTATION ACCURACY FOR THE TWO EVALUATED METHODS # mixtures t≤5ms t≤10ms t≤15ms t≤20ms t≤25ms 1 21.40 41.10 66.25 84.46 93.50 2 22.79 44.54 65.90 81.29 90.37 4 21.05 42.26 62.89 77.40 86.90 8 21.19 42.63 65.95 82.27 90.91 1 32.44 59.03 76.68 90.96 95.62 2 26.99 53.21 75.47 88.72 94.29 4 74.10 87.77 94.27 36.35 60.94 8 28.34 54.70 73.56 86.65 94.17

t≤30ms 98.07 97.69 94.22 95.34 98.32 98.77 97.74 98.16

MAE(ms) 11.8137 13.1437 15.0019 12.1529 9.3625 9.3832 9.6704 9.8739

TABLE II QUALITY RATING SCALE FOR THE MOS TEST Rating Quality 5 Excellent 4 Good 3 Fair 2 Poor 1 Bad

B. Perceptual Quality of Synthetic Speech In order to investigate whether the improvement in the accuracy of the automatic speech segmentation methods has an impact on the overall quality of the synthetic speech, a subjective test must be performed. The necessity of the subjective tests lies under the fact that they are considered as more reliable tests than the objective ones, since they are directly based on ratings coming from human listeners. One of the most well known and widely used subjective tests is the mean opinion score (MOS) [21], which is a procedure based on the absolute category rating (ACR) [22]. In this subjective test the quality of every utterance is rated by human listeners according to a 5-point scale, shown in Table II. For the needs of our evaluation tests, 20 different sentences were synthesized using each of the 8 segmentation results of Table I. This resulted in 160 sentences, to which 20 more sentences using the manual phonetic time-alignment were appended. A corpus-based unit selection TTS system was build based on the Festival Speech Synthesis framework [23]. In order the MOS test to be able to capture the differences in the synthetic speech quality among the 8 segmentation results, the test set was selected with grate concern in order to contain speech units for which the segmentation methods raise significant differences in the segmentation boundaries [24]. Furthermore, a second constraint must be taken under consideration according to [25]. In each test sentence one of the diphones must not appear in the training set, incorporating in this way at least one “difficult” part in the test sentence concerning the segmentation task. Fifteen listeners participated in this test. The listeners were native Greek speakers. A training phase of listening of 3 sentences of each case took place in order for the listeners to become familiar with the synthetic speech synthesized in all cases. The listeners were allowed to listen to each test sentence as many times as they wished in order to rate those according to the MOS test. Finally it should also be noted that the sentences utilizing the test set were chosen from the same categories as the training sentences and no overlapping between the two sets existed. The results of the MOS test are presented in Table III. As can be seen, the hybrid method outperforms the baseline method in all the cases of all number of mixtures (1, 2, 4, 8). The hybrid method using 1 mixture (shown in bold in Table III) was rated by 3.29, which is the highest MOS score among all the automatic segmentation methods, rating higher than the best baseline case by 0.52 MOS score (using 1 mixture). Consequently, the synthetic speech synthesized using the

TABLE III RESULTS OF THE MOS TEST Segmentation Method # Mixtures Baseline 1 Baseline 2 Baseline 4 Baseline 8 Hybrid 1 Hybrid 2 Hybrid 4 Hybrid 8 Manually -

MOS 2.77 2.53 2.37 2.61 3.29 3.13 3.06 2.98 3.47

database segmented using the hybrid method with 1 mixture, was found to be of better quality by the listeners. In addition, it must be pointed out that the MOS score of this method was very close to the score obtained in the case of synthetic speech produced using the manually segmented labels, indicating how close to each other the quality of the two synthetic speech signals was. Furthermore, comparing the subjective results with the objective ones (i.e. the segmentation accuracy), it can be seen that the quality of the synthetic speech is strongly correlated with the segmentation accuracies shown in Table I, especially for the cases of 15, 20 and 25 milliseconds of tolerances. The above described experimental results indicate the dependence of the quality of the produced synthetic speech on the precision of the databases’ phonetic time-alignments. This is in agreement with previous studies [24, 26], where human evaluation tests showed the improvement of the quality of the synthetic speech signals with the introduction of more effective segmentation methods. V. CONCLUSION The development of corpus-based Text-to-Speech systems is strongly related to the availability of speech databases with phonetic labels and time-alignment. Since manual phonetic segmentation of speech is difficult, automatic phonetic segmentation methods are needed, in order to shorten the development cycle of speech synthesis systems. In the present work we studied the performance of a new HMM-based speech segmentation method. The method is based on hybrid embedded and isolated-unit trained models. It was tested on a single Greek female speaker database. The experimental results showed that the hybrid HMM-based method offered significant improvement in the speech segmentation accuracy, comparing to the baseline HMM-based method. Furthermore, the human evaluation tests showed that the use of the time-alignments, produced by the hybrid method, to build a speech synthesizer, results in the production of synthetic speech of significantly better quality, comparing to the baseline method. Finally, the use of the hybrid HMM-based method leads to synthetic speech signals of quality close to the one produced when using hand-made time-alignments. We deem that this method can offer advantage in the development of Text-to-Speech systems, since it offers reduction of the development time while preserving satisfying quality of the synthesized speech signals.

ACKNOWLEDGMENT This work was supported by the MoveOn project (IST-2005034753), which is co-funded by the FP6 of the European Commission. REFERENCES [1] A.J. Hunt and A.W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” Proceedings of IEEE ICASSP'96, pp.373-376, 1996. [2] J.P. van Hemert, “Automatic segmentation of speech,” IEEE Trans. Signal Processing, vol. 39(4), pp. 1008-1012, 1991. [3] F. Malfrere, O. Deroo, T. Dutoit, and C. Ris, “Phonetic alignment: speech synthesis-based vs. Viterbi-based,” Speech Communication, vol. 40, pp. 503515, 2003. [4] J. Keshet, S.S. Shwartz, Y. Singer, and D. Chazan, “Phoneme alignment based on discriminative learning,” Proceedings of the Interspeech'05, pp. 29612964, 2005. [5] K. Torkkola, “Automatic alignment of speech with phonetic transcription in real time,” Proceedings of IEEE ICASSP'98, pp. 611-614, 1998. [6] B.L. Pellom, and J.H. Hansen, “Automatic segmentation of speech recorded in unknown noisy channel characteristics,” Speech Communication, vol. 25, pp. 97-116, 1998. [7] F. Brugnara, D. Falavigna, and M. Omologo, “Automatic segmentation and labeling of speech based on hidden Markov models,” Speech Communication, vol. 12, pp. 357-370, 1993. [8] J. Adell, A. Bonafonte, J.A. Gomez, and M.J. Castro, “Comparative study of automatic phone segmentation methods for tts,” Proceedings of IEEE ICASSP'05, pp. 309-312, 2005. [9] I. Mporas, T. Ganchev and N. Fakotakis “A hybrid architecture for automatic segmentation of speech waveforms,” Proceedings of IEEE ICASSP’08, pp. 4457-4460, 2008. [10] J. Garofolo “Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database,” National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA, 1988. [11] J.D. Forney, “The Viterbi Algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268-278, 1978. [12] L.E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Annals of Mathematical Statistics, vol. 41(1), pp. 164-171,

1970. [13] F. Brugnara, D. Falavigna and M. Omologo, “Automatic segmentation and labeling of speech based on hidden Markov models,” Speech Communication, vol. 12, pp. 357-370, 1993. [14] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland P., “The HTK Book (for HTK Version 3.4),” Cambridge University Engineering Department, 2006. [15] P. Zervas, N. Fakotakis, and G. Kokkinakis, “Development and evaluation of a prosodic database for Greek speech synthesis and research,” Journal of Quantitative Linguistics, vol. 15 (2), pp. 154-184, 2008. [16] J.C. Wells, “SAMPA computer readable phonetic alphabet,” In D. Gibbon, R. Moore, and R. Winski, (eds.), 1997. Handbook of Standards and Resources for Spoken Language Systems. Berlin and New York: Mouton de Gruyter. Part IV, section B. [17] D.T. Toledano, L.A.H Gomez, and L.V. Grande, “Automatic phonetic segmentation,” IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 617-625, 2003. [18] A. Ljolje, and M.D. Riley, “Automatic speech segmentation for concatenative inventory selection,” Progress in Speech Synthesis, Springer, pp. 305-311, 1997. [19] J. Matousek, D. Tihelka, and J. Psutka, “Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundaryspecific correction,” Proceedings Eurospeech’03, pp. 301-304, 2003. [20] L. Wang, Y. Zhao, M. Chu, J. Zhou, and Z. Cao, “Refining segmental boundaries for TTS database using fine contextual-dependent boundary models,“ Proceedings IEEE ICASSP’04, vol. 1, pp. 641-644, 2004. [21] ITU-TRecommendation P.800.1. Mean opinion score (MOS) terminology, 2003. [22] ITU-TRecommendation P.910. Subjective video quality assessment methods for multimedia applications, 2008. [23] A. Black, and P. Taylor, ”The Festival speech synthesis system: system documentation,” Technical Report HCRC/TR-83, Human Communication Research Centre, University of Edinburgh, Scotland, UK, 1997. [24] S. Jarifi, D. Pastor, and O. Rosec, “A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis,” Speech Communication, vol. 50, no.1, pp. 67-80, 2008. [25] J. Kominek, and A.W. Black, “A family-of-models approach to HMMbased segmentation for unit selection speech synthesis,” Proccedings Interspeech’04, pp. 1385-1388, 2004. [26] K.S. Lee, “MLP-based phone boundary refining for a TTS database,” IEEE Trans. Audio, Speech, Language Processing, vol. 14(3), pp. 981–989, 2006.