support vector regression fusion scheme in phone ... - Semantic Scholar

1 downloads 0 Views 255KB Size Report
A fusion scheme of phone duration models (PDMs) is presented in this .... X , for the jth instance (1≤j≤J) of the phone p, ... X , which is used for training the N individual PDMs,. PDMn. .... quality of every utterance is rated by human listeners.
SUPPORT VECTOR REGRESSION FUSION SCHEME IN PHONE DURATION MODELING Alexandros Lazaridis, Iosif Mporas, Todor Ganchev1 and Nikos Fakotakis Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rion 26500, Greece Tel. +30 2610 996496, Fax. +30 2610 997336 {alaza, imporas, fakotaki}@upatras.gr, [email protected]

ABSTRACT A fusion scheme of phone duration models (PDMs) is presented in this work. Specifically, a support vector regression (SVR)-fusion model is fed with the predictions of a group of independent PDMs operating in parallel. The American-English KED TIMIT and the Greek WCL-1 databases are used for evaluating the PDMs and the fusion scheme. The fusion scheme contributes to the accuracy improvement over the best individual model, achieving a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE), by 1.9% and 2.0% on KED TIMIT, and 2.6% and 1.8% respectively on WCL-1. Moreover, for evaluating the impact the accuracy improvement will have on synthetic speech, perceptual evaluation test was performed. This test showed that the accuracy improvement achieved by the SVR-fusion would contribute to the improvement of the naturalness of synthetic speech. Index Terms— Phone duration modeling, statistical modeling, speech synthesis, text-to-speech 1. INTRODUCTION In speech synthesis the accurate modeling of prosody becomes mandatory for producing synthetic speech of high quality. The main aspects of prosody are the phone duration, fundamental frequency and loudness of speech. The phone duration controls the rhythm and the tempo of the speech [1]. Consequently, flattening the prosody in synthetic speech would lead to a monotonous and without rhythm speech, degrading the quality and specially the naturalness of speech. In the last decades, phone duration modeling has been applied in speech synthesis through two main categories of models: the rule-based [2] methods, which are based on manually produced rules, and the data-driven or statistical methods [1,3,4], which are based on statistical and artificial neural network (ANN)-based techniques. The main drawback of rule-based methods is the need for manual rule extraction which is a time consuming procedure. This can

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

4732

be overcome with the use of data-driven approaches and large databases for the automatic production of these rules. The related features can be extracted from several levels of speech representation such as the phonetic, prosodic, linguistic and morphosyntactic level. This work is an extension of the study presented in [5], where a fusion phone duration modeling scheme that relies on multiple independent predictors was proposed. These models operate on common inputs and their predictions are combined through a fusion model. Since different algorithms perform better in different conditions, we assume that an appropriate combination of their outputs would result in more precise phone duration predictions. In this work we briefly present the fusion scheme and evaluate the individual models and the fusion model using objective tests. Furthermore, since the phone duration model is a component implemented in TTS systems (e.g. through the target function in unit-selection, in hidden Markov model approaches or in emotional speech synthesis) [6-8], evaluating the impact of the accuracy improvement of the fusion scheme on the naturalness of speech would be very useful. Thus, subjective evaluation is performed in order to investigate whether the improvement in the accuracy of the fusion scheme over the individual models would result in the improvement of the naturalness of synthetic speech. The remainder of this paper is organized as follows. In Section 2 we outline the proposed fusion scheme. In Section 3 we briefly outline the individual phone duration modeling algorithms, the speech databases and the experimental setup used in the evaluation. The experimental results are presented and discussed in Section 4 and finally this work is concluded in Section 5. 2. SVR-FUSION MODELING Accurate prediction of the duration of the uttered phones is very important in phone duration modeling. The proposed fusion scheme is presented in Fig. 1. Multiple independent phone duration models (PDMs) operating on a common input predict the durations of the phones. The scheme relies on the combination of these predictions through the fusion stage, where a machine learning algorithm uses them for

ICASSP 2011

In the operational phase the test dataset is used for evaluating the proposed scheme. Specifically, the input vector, X jp , for the jth instance of the phone p, is used as input to the N individual PDMs, PDMn, with 1”n”N (refer to Fig.1). Consequently, the outputs of the individual PDMs, y jp , n , computed as in eq. 2, are appended together creating the vector of predictions, Y jp . This vector serves as input to the fusion model which computes the final phone duration prediction for the jth instance as: (4) O jp g p (Y j p ) , j=1,2,…,J, with O jp

Fig. 1. Block diagram of the fusion scheme using multiple predictions.

obtaining more precise phone duration predictions. In the proposed scheme, SVR algorithm is used as the fusion model. The training and the operational phases of the proposed fusion scheme are presented next. In the training phase of the proposed scheme, which is a two-step procedure, two non-overlapping datasets, the training and the development sets, are used. In the first step, the individual PDMs are built using the training dataset. In the second step, these models are used for processing the development dataset producing a set of phones’ durations. Consequently, the fusion algorithm is trained using these predictions, along with the ground truth labels (manually annotated tags). This procedure can be formalized as follows: Let us define a set of N individual PDMs, PDMn, with 1”n”N. Moreover, let us define a M-dimensional feature vector consisting of numeric and non-numeric features, X jp , for the jth instance (1”j”J) of the phone p, X jp

>T1 ,T2 ,...,T m ,...,T M @T ,

j=1,2,…,J,

(1)

where șm is the mth feature (1”m”M) of the feature vector X jp , which is used for training the N individual PDMs, PDMn. Consequently, the trained individual PDMs process the development dataset in order to produce a set of phone duration predictions, y jp , n , of the nth duration model for the jth instance of the phone p: p j=1,2,…,J, (2) y jp , n f PDMn ( X jp ),

where y jp , n  \ . The vector, Y jp , is constructed by stacking together the predictions of the individual PDMs, y jp , n

Y jp

^y `

p,n T j

,

j=1,2,…,J,

(3)

where 1”n”N, for the jth instance of the phone p. This vector, along with the ground truth labels, is used for training the fusion model.

4733

\.

Based on the observation that different predictors which rely on different machine learning algorithms, err in a dissimilar manner, it is believed that the proposed fusion scheme will improve the accuracy in the phone duration modeling task. The implementation of an appropriate fusion scheme, capable of learning the proper mapping between a set of predictions and the true phones’ durations, is expected to lead to the improvement of the accuracy in phone duration modeling.

3. EXPERIMENTAL SETUP We evaluate the proposed scheme by employing eight different phone duration modeling algorithms, which have been reported successful on the duration modeling task. Specifically, these algorithms are: linear regression (Linear Regression) [3], regression (m5pR) and model (m5p) trees [9], additive regression (Add. Reg. m5pR and Add. Reg. REPTrees) [10] and bagging algorithms (Bagging m5pR and Bagging REPTrees) [11]. The latter four algorithms are meta-learning algorithms using regression trees as base classifiers. Finally, the support vector regression (SVR) model (SMOreg) [12], which employs the sequential minimal optimization (SMO) algorithm for training a support vector classifier [13], was used. The SVR algorithm was also used for the proposed fusion scheme as fusion algorithm (SMOreg-fusion). Since these algorithms belong to different categories of numerical prediction methods, we suppose that they will err in a different manner. In the evaluation experiments two databases, the American-English speech database CSTR US KED TIMIT [14] and the Modern Greek speech database WCL-1 [15], were used. KED TIMIT consists of 453 phonetically balanced sentences (3400 words approximately) uttered by a Native American male speaker. The phone set provided with the database [14] consisting of 44 phones, was adopted for the experiments on KED TIMIT database. The WCL-1 database consists of 5500 words distributed in 500 paragraphs, each one of which may be a single word, a short sentence, a long sentence, or a sequence of sentences uttered by a female professional radio actress. Respectively, the phone set provided with the database [15] consisting of 34 phones, was adopted for the experiments using the WCL-1

database. The manually labeled phones’ durations, provided with the databases, were used for our experiments together with 33 other features. In brief, eight phonetic, three phonelevel, thirteen syllable-level, two word-level, one phraselevel and six accentual features were used. Along with these, which concern each current instance, the corresponding information concerning the one or two previous and next instances (temporal context information) was also used for some of these features. The overall size of the feature vector is 93 [5]. Furthermore, the 10-fold cross validation technique was utilized. In each fold the training data were split in two portions, the training dataset and the development dataset. The first set was used for training the individual PDMs and it amounted to approximately 60% of the full dataset. The second set was used for training the fusion model and it amounted to approximately 30% of the full dataset. Furthermore, for evaluating the performance of the eight individual PDMs, as well as the performance of the fusion scheme, the test dataset, amounting to approximately 10% of the full dataset, was used. The two most commonly used figures of merit, namely the mean absolute error (MAE) and the root mean squared error (RMSE) [1,4], between the predicted duration and the actual (real) duration of each phone, were used.

4. EXPERIMENTAL RESULTS 4.1. Objective evaluation First we examined the performance of the eight individual models and consequently the proposed fusion scheme was evaluated. The MAE and the RMSE for all the individual models and the fusion scheme are shown in Table 1. Specifically, in Table 1 (a) and 1 (b) the results for the KED TIMIT and WCL-1 databases are presented, respectively. The results of the best-performing model, among the eight individual PDMs, are in bold, the results of the second-best individual model are in italic and the ones of the fusion scheme are in grey highlighted cells. As can be seen in Table 1, the SVR individual model (SMOreg) outperformed all the other individual models for both databases. Specifically, for the KED TIMIT database, the individual SMOreg outperformed the second-best model, Add. Reg. m5pR, by 5.5% and 3.7% in terms of MAE and RMSE respectively. For the WCL-1 database, the SMOreg model outperformed the second-best model, Linear Regression, by approximately 6.8% and 3.7% in terms of MAE and RMSE respectively. The reasoning behind the improvement in accuracy achieved by the SMOreg models, on both databases, is the advantage of support vector machines (SVMs) to cope better with high-dimensional feature spaces [16] in comparison to the other algorithms. Moreover, the proposed fusion scheme outperformed the best individual PDM (SMOreg) for both databases.

4734

Table 1. Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) (in milliseconds) for the individual models and the fusion scheme on: (a) the KED TIMIT database, and (b) the WCL-1 database. (a) results on the KED TIMIT database KED TIMIT database SMOreg-fusion Individual models SMOreg Add. Reg. m5pR Add. Reg. REPTrees Bagging m5pR m5p Bagging REPTrees m5pR Linear Regression

MAE RMSE 14.66 20.14 14.95 15.82 16.29 16.51 16.62 16.69 16.93 17.15

20.56 21.35 22.19 22.14 22.23 23.04 22.72 22.89

(b) results on the WCL-1 database WCL-1 database SMOreg-fusion Individual models SMOreg Linear Regression Add. Reg. REPTrees Add. Reg. m5pR Bagging m5pR m5p Bagging REPTrees m5pR

MAE RMSE 16.35 24.76 16.78 18.00 18.08 18.13 18.14 18.31 18.93 19.07

25.21 26.19 26.94 26.38 26.72 27.17 27.77 27.71

Specifically, the SMOreg-fusion model outperformed the individual SMOreg model by 1.9% and 2.0% in terms of MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on WCL-1 database, respectively. We used the Wilcoxon test [17] in order to verify the statistical significance of the accuracy improvement, obtained with the fusion scheme when compared to the accuracy of the best individual model (SMOreg). We performed the Wilcoxon test for a confidence level of 95% for the results obtained on both databases (KED TIMIT and WCL-1), and the accuracy improvements were found statistically significant (p-values 5.77e-9 and 3.5e-11, respectively).

4.2. Subjective evaluation Fifteen subjects participated in the perceptual evaluation test, 6 females and 9 males. The mean opinion score (MOS) [18] was used for the perceptual evaluation. In MOS test the quality of every utterance is rated by human listeners according to a 5-point scale (1 - Unsatisfactory, 2 - Poor, 3 Fair, 4 - Good, 5 - Excellent). For each database, 10 sentences of the test dataset were used. The subjects were presented with one stimuli of unmodified original speech (Original) and 3 stimulus of modified speech according to the phones’ durations predicted from the 3 models, i.e. the second-best individual

scheme when implemented in TTS systems, would contribute to the improvement of the quality of synthetic speech.

Table 2. Perceptual evaluation test using MOS scale (a) results on the KED TIMIT database KED TIMIT database Original Second-best SMOreg SMOreg-fusion

MOS 4.75 3.30 3.51 3.63

STD 0.48 0.83 0.81 0.71

MOS 4.82 3.41 3.65 3.77

STD 0.42 0.84 0.74 0.68

6. REFERENCES

(b) results on the WCL-1 database WCL-1 database Original Second-best SMOreg SMOreg-fusion

(Second-best), the best individual (SMOreg) and the fusion scheme (SMOreg-fusion) model. The PRAAT software [19] was used in order to modify the phones' durations of the original speech stimulus according to the predicted phones' durations and the Pitch Synchronous Overlap Add (PSOLA) [20] method was used for synthesizing these stimulus. In Table 2 the results are presented in MOS scale along with the standard deviation of the scores (STD), for the KED TIMIT and WCL-1 databases. As can be seen for both databases, the original speech (Original) achieved the best scores, 4.75 and 4.82 in MOS scale for the KED TIMIT and WCL-1 databases respectively, as expected. The SMOregfusion models achieved the highest scores 3.63 and 3.77 for the KED TIMIT and WCL-1 databases, respectively, followed by the individual SMOreg models with 3.51 and 3.65 respectively outperforming the second-best individual models for both databases. As can be seen, the results of the MOS tests are in line with the results of the objective tests. Finally, we applied the Wilcoxon test [17] on the results of the perceptual test verifying that the differences in the MOS scores are statistically significant. The Wilcoxon test was performed for confidence level of 95%. Specifically, for the individual SMOreg model in respect to the second-best model, the p-values were equal to 2.58e-8 for KED TIMIT and to 1.97e-9 for WCL-1, and for the SMOreg-fusion model in respect to the individual SMOreg model, the p-values were equal to 1.31e-5 for KED TIMIT and to 2.21e-5 for WCL-1.

5. CONCLUSIONS In this work we presented a fusion scheme for accurate phone duration modeling, based on the combination of multiple predictions estimated by independent duration models. The SVR-fusion scheme showed improvement in the accuracy over the best individual model (SVR). Furthermore the perceptual evaluation, in terms of mean opinion score test, showed a respective improvement in the naturalness of the speech. Consequently, the improvement in phone duration modeling achieved with the fusion

4735

[1] J. Yamagishi, H. Kawai, and T. Kobayashi, “Phone duration modeling using gradient tree boosting”, Speech Communication, vol. 50, no. 5, pp. 405-415, 2008. [2] D.H. Klatt, “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence”, Journal of the Acoustic Society of America, vol. 59, pp. 1209-1221, 1976. [3] K. Takeda, Y. Sagisaka, and H. Kuwabara, “On sentence-level factors governing segmental duration in Japanese”, Journal of Acoustic Society of America, vol. 86, no. 6, pp. 2081-2087, 1989. [4] J.P.H. van Santen, “Contextual effects on vowel durations”, Speech Communication, vol. 11, pp. 513-546, 1992. [5] A. Lazaridis, I. Mporas, T. Ganchev, and G. Kokkinakis, N. Fakotakis, “Improving Phone Duration Modeling using Support Vector Regression Fusion”, Speech Communication, (In Press). [6] O. Goubanova and S. King, “Bayesian networks for phone duration prediction”, Speech Communication, vol. 50, no. 4, pp. 301-311, 2008. [7] K. Tokuda, H. Zen, and A. Black, “An HMM-based speech synthesis system applied to English”, Proc. IEEE Speech Synthesis Workshop, 2002. [8] V. Strom, R. Clark, and S. King, “Expressive prosody for unitselection speech synthesis”, Proc. Interspeech, 2006. [9] R.J. Quinlan, “Learning with continuous classes”, Proc. 5th Australian Joint Conference on Artificial Intelligence, Singapore, pp. 343-348, 1992. [10] J.H. Friedman, “Stochastic gradient boosting”, Comput. Statist. Data Anal., vol. 38, no. 4, pp. 367-378, 2002. [11] L. Breiman, “Bagging predictors”, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996. [12] J. Platt, “Fast training of support vector machines using sequential minimal optimization”, In: B. Scholkopf, C. Burges, A. Smola, (Eds.), Advances in kernel methods: Support vector learning, MIT Press, Cambridge, pp. 185-208, 1999. [13] A.J. Smola and B. Scholkopf, “A tutorial on support vector regression”, Royal Holloway College, London, U.K., NeuroCOLT Tech. Rep. TR 1998-030, 1998. [14] CSTR, “CSTR US KED TIMIT”, 2001, University of Edinburgh, http://www.festvox.org/dbs/dbs_kdt.html. [15] P. Zervas, N. Fakotakis, and G. Kokkinakis, “Development and evaluation of a prosodic database for Greek speech synthesis and research”, Journal of Quantitative Linguistics, vol. 15, no. 2, pp. 154-184, 2008. [16] Vapnik, V., “Statistical Learning Theory”, Wiley, New York, 1998. [17] F. Wilcoxon, “Individual comparisons by ranking methods“, Biometrics, vol. 1, pp. 80-83, 1945. [18] ITU-TRecommendation P.800.1, “Mean opinion score (MOS) terminology”, 2003. [19] P. Boersma and D. Weenink, “Praat: doing phonetics by computer”, http://www.fon.hum.uva.nl/praat/ [20] E. Moulines and F. Charpentier, “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones”, Speech Communication, vol. 9, no. 5-6, pp. 453-467, 1990.