Document not found! Please try again

An evaluation of cost functions sensitively capturing local ... - CiteSeerX

12 downloads 3223 Views 876KB Size Report
a Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan b ATR Spoken ...
Speech Communication 48 (2006) 45–56 www.elsevier.com/locate/specom

An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis Tomoki Toda a

a,b,*

, Hisashi Kawai

b,c

, Minoru Tsuzaki

b,d

, Kiyohiro Shikano

a

Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan ATR Spoken Language Communication Research Laboratories, 2-2-2, Hikaridai, ‘‘Keihanna Science City’’, Kyoto 619-0288, Japan c KDDI R&D Laboratories, 2-1-15 Ohara, Kamifukuoka, Saitama 356-8502, Japan d Kyoto City University of Arts, 13-6 Kutsukake-cho, Oe, Nishikyo-ku, Kyoto 610-1197, Japan

b

Received 11 May 2004; received in revised form 12 February 2005; accepted 27 May 2005

Abstract In this paper, we evaluate various cost functions for selecting a segment sequence in terms of the correspondence between the cost and perceptual scores to the naturalness of synthetic speech. The results demonstrate that the conventional average cost, which shows the degradation of naturalness over the entire synthetic utterance, has better correspondence to the perceptual scores than the maximum cost, which shows the worst local degradation of naturalness. Furthermore, it is shown that root mean square (RMS) cost, which takes into account both the average cost and the maximum cost, has the best correspondence. We also show that the naturalness of synthetic speech can be improved by using the RMS cost for segment selection. Then, we investigate the effects of applying the RMS cost to segment selection in comparison to those of applying the average cost. Experimental results show that in segment selection based on the RMS cost, a larger number of concatenations causing slight local degradation are performed so that concatenations causing greater local degradation are avoided.  2005 Elsevier B.V. All rights reserved. PACS: 43.72.Ja Keywords: Segment selection; Cost function; Perceptual evaluation; RMS cost

* Corresponding author. Present address: Graduate School of Information Science, Nara Institute of Science and Technology, 89165 Takayama, Ikoma, Nara 630-0192, Japan. Tel.: +81 743 72 5282; fax: +81 743 72 5289. E-mail addresses: [email protected] (T. Toda), [email protected] (H. Kawai), [email protected] (M. Tsuzaki), [email protected] (K. Shikano).

0167-6393/$ - see front matter  2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.05.011

46

T. Toda et al. / Speech Communication 48 (2006) 45–56

1. Introduction In corpus-based concatenative text-to-speech (TTS), speech synthesis based on segment selection (Sagisaka, 1988) has recently become the focus of much synthesis-related work (Syrdal et al., 2000). In segment selection, an optimum sequence of segments is selected from a speech corpus by minimizing the cost of capturing the degradation of naturalness. Therefore, in order to synthesize speech naturally, it is important to use a cost that corresponds to the perceptual characteristics (Lee, 2001; Peng et al., 2002). In the process of designing cost functions, however, it is doubtful whether this correspondence is preserved because there are approximations and assumptions, e.g., using acoustic measures that are not accurate enough to capture perceptual characteristics (Klabbers and Veldhuis, 2001; Stylianou and Syrdal, 2001), and assuming independence among various factors. Moreover, the validity of the commonly used average cost (Campbell and Black, 1997; Hunt and Black, 1996), which shows the degradation of naturalness over the entire synthetic utterance, is not always well investigated, so it is possible that local degradation of naturalness has a considerable effect on the naturalness of synthetic speech. Therefore, optimization of the cost function in terms of correspondence of the cost to perceptual measures is still worthwhile. In this paper, we first investigate the correspondence between our cost and the perceptual scores. Then various functions for integrating local costs of individual segments into a global cost over the entire segment sequence are evaluated in terms of correspondence between the cost and the mean opinion score (MOS) as determined from the results of perceptual experiments. As a result, we show that the root mean square (RMS) cost has the best correspondence to the perceptual scores. The RMS cost is sensitive both to the global trend of the local costs (as captured by the average cost) and to particularly large local costs (as captured by the maximum cost). We also clarify that the naturalness of synthetic speech can be improved by using the RMS cost in segment selection. Furthermore, we compare segment selection based on the

RMS cost with that based on the average cost. Selected segment sequences are analyzed from various points of view to clarify how the local degradation of naturalness can be alleviated by using the RMS cost. The paper is organized as follows: In Section 2, our cost function for the segment selection is described. In Section 3, we present a perceptual evaluation of the costs. In Section 4, the effectiveness of using the RMS cost in segment selection is analyzed. Finally, we summarize this paper in Section 5.

2. Cost function for segment selection The cost function for segment selection is viewed as a mapping, as shown in Fig. 1, of objective features, e.g., acoustic measures and linguistic information, into a perceptual measure. A cost is considered to be the predicted perceptual measure that is expected to capture the degradation of synthetic speech naturalness. In this paper, only phonetic information is directly used as a predictor variable of the cost. Other linguistic information is used in an indirect manner: it is first used to predict targets of prosodic parameters, which are then used as predictor variables of the cost. The components of the cost function should be determined on the basis of results of perceptual experiments. However, it is almost impossible in practice to experimentally map acoustic measures

Fig. 1. Schematic diagram of cost function. Cost function performs a mapping of observable features into a perceptual measure.

T. Toda et al. / Speech Communication 48 (2006) 45–56

into a perceptual measure except when the acoustic features have simple structures, as in the case of F0 or phoneme duration. In most cases, acoustic features have such complex structures that this kind of brute force experiment is infeasible. Although various studies have been conducted to search for an acoustic measure that can capture perceptual characteristics, nothing satisfactory has been found so far (Klabbers and Veldhuis, 2001; Stylianou and Syrdal, 2001; Ding et al., 1998; Wouters and Macon, 1998). On the other hand, phonetic information can be mapped into perceptual measures from perceptual experiments (Kawai and Tsuzaki, 2002). However, acoustic measures that can represent the characteristics of instances of waveform segments are still necessary because phonetic information can only evaluate the difference between phonetic categories. 2.1. Local cost The local cost shows the degradation of naturalness caused by using an individual candidate segment. In our system, the cost function is defined as a weighted sum of the five sub-cost functions shown in Table 1. Each sub-cost reflects either source information or vocal tract information. We have not yet performed perceptual experiments for some sub-costs. For such sub-costs, we define the sub-cost functions using acoustic measures. The Cpro sub-cost captures the degradation of naturalness caused by the difference in prosodic parameters (F0 contour and phoneme duration) between a candidate segment and the target. This sub-cost function is estimated from the results of perceptual experiments. The C F 0 sub-cost captures the degradation of naturalness caused by an F0 discontinuity at a segment boundary. This sub-cost is calculated as

Table 1 Sub-cost functions Source information

Prosody (F0, duration) F0 discontinuity

Cpro CF 0

Vocal tract information

Phonetic environment Spectral discontinuity Phonetic inappropriateness

Cenv Cspec Capp

47

the distance based on the log-scaled F0 at the boundary. The Cenv sub-cost captures the degradation of naturalness caused by the mismatch of phonetic environments between a candidate segment and the target. This sub-cost is determined from the results of perceptual experiments. Even if the mismatch of phonetic environments does not exist, the sub-cost does not always become 0 because this sub-cost also reflects the difficulty of concatenation caused by the uncertainty of segmentation. The Cspec sub-cost captures the degradation of naturalness caused by the spectral discontinuity at a segment boundary. This sub-cost is calculated as the weighted sum of a mel-cepstral distortion between frames of a segment and those of the preceding segment around the boundary. Mel-cepstral coefficients are calculated from the smoothed spectrum obtained by the STRAIGHT analysis– synthesis method (Kawahara et al., 1999). The Capp sub-cost denotes the phonetic inappropriateness and captures the degradation of naturalness caused by using outlying segments. The sub-cost is calculated as the mel-cepstral distortion between the mean vector of a candidate segment and that of the target. The local cost LC(ui, ti) at a candidate segment ui for a target phoneme ti is given by LCðui ; ti Þ ¼ wpro  C pro ðui ; ti Þ þ wF 0  C F 0 ðui ; ui1 Þ þ wenv  C env ðui ; ui1 Þ þ wspec  C spec ðui ; ui1 Þ þ wapp  C app ðui ; ti Þ; wpro þ wF 0 þ wenv þ wspec þ wapp ¼ 1;

ð1Þ ð2Þ

where wpro, wF 0 , wenv, wspec, and wapp denote weights for the individual sub-costs. In this paper, these weights are equal, i.e., 0.2. All sub-costs are normalized so that they have positive values with the same mean. The preceding segment ui1 shows a candidate segment for the (i  1)th target phoneme ti1. When the candidate segments ui1 and ui are connected in the corpus, the sub-costs Cenv, Cspec, and Capp become 0. In order to represent the costs in a simpler form, we divide the five sub-costs into two

48

T. Toda et al. / Speech Communication 48 (2006) 45–56

commonly used costs, i.e., a target cost Ct and a concatenation cost Cc (Campbell and Black, 1997; Hunt and Black, 1996). These costs are given by

would have considerable effect on the degradation of naturalness in synthetic speech. To investigate this assumption, let us define the maximum cost (MC) as the integrated cost given by

C t ðui ; ti Þ ¼ wpro =wt  C pro ðui ; ti Þ

MC ¼ maxfLCðui ; ti Þg;

þ wapp =wt  C app ðui ; ti Þ;

i

ð3Þ

C c ðui ; ui1 Þ ¼ wenv =W c  C env ðui ; ui1 Þ þ wspec =wc  C spec ðui ; ui1 Þ þ wF 0 =wc  C F 0 ðui ; ui1 Þ;

ð4Þ

wt ¼ wpro þ wapp ;

ð5Þ

wc ¼ wF 0 þ wenv þ wspec ;

ð6Þ

and then the local cost is written as LCðui ; ti Þ ¼ wt  C t ðui ; ti Þ þ wc  C c ðui ; ui1 Þ; wt þ wc ¼ 1.

ð7Þ ð8Þ

2.2. Integrated cost In segment selection, the optimum set of segments for an utterance is selected from a speech corpus. Therefore, we need to integrate local costs for individual segments into a cost for a segment sequence. This cost is referred to as an integrated cost in this paper. The average cost (AC) is often used as the integrated cost (Campbell and Black, 1997; Hunt and Black, 1996), and it is given by AC ¼

N 1 X  LCðui ; ti Þ; N i¼1

ð9Þ

where N denotes the number of targets in the utterance. The target t0 and the candidate u0 show the silences before the utterance, and tN and uN show the silences after the utterance. The sub-costs Cpro and Capp are set to 0 for the pause. Minimizing the average cost is equivalent to minimizing the sum of the local costs in the selection. Because the average cost shows the degradation of naturalness over the entire synthetic utterance, a segment with a large cost can be selected in the output sequence of segments even if it is optimal in terms of the average cost. It might be assumed that the largest cost in the sequence, i.e., the local degradation of naturalness,

1 < i < N.

ð10Þ

In order to further evaluate various integrated costs between these two types of integration methods, let us use the norm cost, NCp, given by " #1p N 1 X p  NCp ¼ fLCðui ; ti Þg . ð11Þ N i¼1 When the coefficient p is set to 1, the norm cost is equal to the average cost. When p is set to infinity, the norm cost is equal to the maximum cost. Thus, this norm cost takes into account both the mean value and the maximum value by varying p. In this paper, we try to find an optimum value of p based on perceptual experiments. 2.3. Segment selection In our Japanese concatenative TTS, which is now under development, segment selection is basically performed with syllable units (CV or V, C: consonant, V: vowel), i.e., concatenations at C–V boundaries are prohibited (Toda et al., 2002). However, concatenations at certain phoneme centers, i.e., vowel centers preceding voiced phonemes and unvoiced fricative centers, are also allowed in order to alleviate audible discontinuity. This algorithm allows the utilization of both phoneme units and diphone units and is similar to the AT&T NextGen TTS system (Syrdal et al., 2000; Conkie, 1999), which performs segment selection based on half-phoneme units.

3. Perceptual evaluation of cost 3.1. Correspondence of cost to a perceptual score We performed a perceptual test on the naturalness of synthetic speech. In order to select a proper set of test stimuli, a large number of utterances were synthesized by varying the corpus size from 0.5 to 32 h in 13 steps in a logarithmic scale. The

T. Toda et al. / Speech Communication 48 (2006) 45–56

49

average cost was used for segment selection. Each utterance consisted of a part of a sentence that was divided by pauses. We synthesized 14,926 utterances that were not included in the corpus, from which we selected a set of 140 stimuli so that the set covers a wide field in terms of both average cost and maximum cost. This selection was performed under the restriction that the number of phonemes in an utterance, the duration of an utterance, and the number of concatenations are roughly equal among the selected stimuli. Specifically, the following selection procedure was performed. Fig. 3. Scatter chart of selected test stimuli.

• We observed frequency distributions on the number of phonemes, the duration, and the number of concatenations for all of synthetic samples. Then, we removed samples in which any of those features was out of the ranges from the first quartile (Q1) to the third quartile (Q3) of the individual distributions. • We plotted a distribution of the average cost and the maximum cost for the remaining samples on a two-dimensional plane. Then, we divided the distribution into a dozen cells. Finally we selected stimulus samples at random from each cell. The distribution of the average cost and the maximum cost for all synthetic utterances and selected test stimuli are shown in Figs. 2 and 3, respectively.

Fig. 2. Distribution of average cost and maximum cost for all synthetic utterances. The correlation coefficient between the average cost and the maximum cost is 0.726.

Natural prosody and the mel-cepstrum sequences extracted from the original utterances were used as input information for segment selection to remove all factors affecting naturalness other than cost. In the waveform synthesis, signal processing for prosody modification was not performed except for power control. Eight listeners participated in the experiment. They evaluated the naturalness on a scale of seven levels, namely 1 (very bad) to 7 (very good). In order to equalize the score range among listeners, the perceptual score, here the MOS, was calculated as an average of the normalized score defined as a Z-score (mean = 0, variance = 1) for each listener. Fig. 4 shows the correlation coefficient between the norm cost and the perceptual score as a function of the coefficient p. The average cost (p = 1)

Fig. 4. Correlation coefficient between norm cost and perceptual score as a function of the coefficient p.

50

T. Toda et al. / Speech Communication 48 (2006) 45–56

has better correspondence to the perceptual scores (correlation coefficient = 0.808) than does the maximum cost (correlation coefficient = 0.685). These results show that the naturalness of synthetic speech is better predicted by the degradation of naturalness over the entire synthetic utterance than by using only the local degradation of naturalness. Fig. 5 shows the relationship between the average cost and the perceptual score, and Fig. 6 shows the relationship between the maximum cost and the perceptual score. When p is set to 2, the norm cost, called the root mean square (RMS) cost, has the best correspondence to the perceptual scores (correlation coefficient = 0.840). The RMS cost can simultaneously take account of large peaks and the overall bias of the local cost. The difference in the absolute values

Fig. 5. Relationship between average cost and perceptual score.

of the correlation coefficient between the RMS cost and the average cost is statistically significant (t = 2.4696, df = 137, p < 0.05). Fig. 7 shows the relationship between the RMS cost and the perceptual score. In order to clarify how well the RMS cost corresponds to the perceptual scores, we compared the correlation between the RMS cost and the perceptual scores of each listener with the interlistener correlation, i.e., the correlation between scores of each listener and the mean scores of the other seven listeners, which would imply a sort of upper bound on predicting the perceptual scores. The correlations for individual listeners are shown in Table 2. As a reference, we also show results for the average cost and the maximum cost. Although the performance of the RMS cost for predicting the perceptual scores does not reach

Fig. 7. Correlation between RMS cost and perceptual score. Table 2 Comparison between inter-listener correlations and absolute values of correlation coefficients between three norm costs and the perceptual scores of individual listeners

Fig. 6. Relationship between maximum cost and perceptual score.

Listener-index

Average cost

Maximum cost

RMS cost

Inter-listener

1 2 3 4 5 6 7 8

0.725 0.682 0.696 0.679 0.697 0.765 0.723 0.552

0.669 0.647 0.520 0.665 0.588 0.573 0.592 0.425

0.778 0.731 0.694 0.742 0.721 0.759 0.745 0.572

0.854 0.852 0.826 0.823 0.821 0.810 0.803 0.666

Average

0.690

0.585

0.718

0.807

T. Toda et al. / Speech Communication 48 (2006) 45–56

that of the inter-listener correlation, again we can see that the RMS cost has the best performance among the other norm costs. Even though the RMS cost is the best individual norm cost, it might still be possible for the perceptual scores to be more accurately represented by some combination of contributions from the various norm costs. In order to test this possibility, we also performed multiple linear regression analysis using the maximum cost and 10 kinds of norm cost, NC1 through NC10, as predictor variables. As a result, the correspondence to the perceptual scores was not improved statistically (multiple correlation coefficient = 0.846) compared with the correspondence of the RMS cost. We also checked resulting weights and partial correlation coefficients for individual costs. It was not observed that the RMS cost had the best contribution to the estimation of MOS. Because the weight values were large different from each other and varied over a large range, it is possible that this result was caused by over-training of this multiple linear regression model. Since the way of calculating the RMS cost is simpler than that of the multiple linear regression, we adopt the RMS cost as the best cost. 3.2. Correspondence of RMS cost to perceptual score in lower range of RMS cost So far we have shown a correspondence between the RMS cost and the perceptual scores for a wide range of costs. However, in order to have the best chance of consistently synthesizing natural-sounding speech, our TTS system uses a large-sized corpus with a high coverage of both phonetic environments and prosody. Some systems, e.g., Chu et al. (2001), also use the largesized corpus for segment selection. In the case of a large-sized corpus, the RMS costs are expected to distribute not in a wide range but in a lower range because segments causing only a slight degradation of naturalness can usually be selected. If we are to eventually use cost functions to estimate the naturalness of synthetic speech in the lower range, that is, speech which may be presumed to be already fairly natural, it will be necessary to have cost functions which are sensitive to finer

51

distinctions among candidate segments. Thus, it is worthwhile to investigate the correspondence to the perceptual scores in a range of lower RMS costs. For this we performed an opinion test on the naturalness of the synthetic speech. Utterances whose RMS costs were less than 0.4 were selected as test stimuli from a large number of utterances synthesized with the RMS cost by varying the corpus size. The number of phonemes in an utterance, the duration of an utterance, and the number of concatenations were restricted as in the previous experiment. The number of selected stimuli was 160. Eight Japanese listeners participated in the experiment. They evaluated the naturalness on a scale of seven levels. We asked listeners to use scores as widely as possible. The normalized perceptual score was calculated in the same way as described in Section 3.1. Please note that the normalized score in this test does not correspond to that in the previous test. The relationship between the RMS cost and the perceptual scores is shown in Fig. 8. The correspondence is much worse (correlation coefficient = 0.400) than that in the case of using stimuli that cover a wide range of the cost (correlation coefficient = 0.840). Because the quality difference between synthetic stimuli is much smaller than those used in the previous test, a variation of perceptual scores among individual listeners

Fig. 8. Relationship between RMS cost and perceptual score in a lower range of RMS cost.

52

T. Toda et al. / Speech Communication 48 (2006) 45–56

becomes large (inter-listener correlation = 0.708). Although predicting perceptual scores in this case is apparently a difficult task, it is obvious that the resolution of the RMS cost is still insufficient compared with the inter-listener correlation. Therefore, we should further improve the cost function. 3.3. Preference test on naturalness of synthetic speech In order to clarify which of the average cost and the RMS cost could select better segment sequences, we performed a preference test on the naturalness of synthetic speech using stimulus pairs to result from two segment sequences for the same target utterances, with one sequences chosen to minimize the average cost and the other to minimize the RMS cost. The corpus size was 32 h, and utterances used as test stimuli were not included in the corpus. Natural prosody and the mel-cepstrum sequences extracted from the original utterances were used as input information for segment selection. Signal processing for prosody modification was not performed except for power control. It is possible to make the preference test further sensitive by using stimulus pairs with large differences in any feature. Therefore, for the listening test we used pairs of segment sequences that had greater cost differences. We selected the pairs with larger differences in RMS cost between the two segment sequences. Moreover, in order to fairly compare the performance of these two costs, we also selected the pairs with larger differences in the average cost. Consequently, we constructed a test set consisting of two sub-sets, a sub-set A including pairs with larger differences in the RMS cost and a sub-set B including pairs with larger differences in average cost. There were 20 pairs in each subset, and the total number of pairs was 35, since 5 pairs were included in both sub-sets. Eight Japanese listeners participated in the experiment. In each trial, stimuli were presented in random order, and listeners were asked to choose which of the two types of synthetic speech sounded more natural.

Fig. 9. Preference score of RMS cost. ‘‘Sub-set A’’ includes stimulus pairs with large differences in the RMS cost and ‘‘Subset B’’ includes those with large differences in the average cost. ‘‘All stimuli’’ includes all stimulus pairs in both sub-sets.

The results in Fig. 9 show that segment selection based on the RMS cost can synthesize speech more naturally than that based on the average cost in all cases: using all stimuli, stimuli in sub-set A only, and stimuli in sub-set B only. Although the differences are statistically significant, they are marginal in practice.

4. Segment selection based on RMS cost In order to clarify the effects caused by using the RMS cost, we compared segment selection based on the RMS cost with that based on the average cost. For this we used an evaluation set consisting of 1131 utterances not included in the corpus used for segment selection. 4.1. Effect of RMS cost on local costs, and target/concatenation costs Fig. 10 shows an example of the local costs of a segment sequence selected by the average cost and those of another segment sequence selected by the novel RMS cost. Some large local costs surrounded by circles are found in the case of the average cost. On the other hand, such large local costs are alleviated in the case of the RMS cost. We statistically investigated the effect of the RMS cost on the local cost using a number of synthetic samples. Fig. 11 shows frequency distributions of the local cost. The corpus size is 32 h. By the definition of the average cost, it is reasonable that the mean value of the local cost for sequences selected with minimum average cost is smaller than that for those selected with the other

T. Toda et al. / Speech Communication 48 (2006) 45–56

Fig. 10. Examples of local costs of segment sequences selected by the average cost and by the RMS cost. ‘‘Av.’’ and ‘‘RMS’’ show the average and the root mean square of local costs, respectively. Phonemes /sil/ and /pau/ show a silence before an utterance and a pause in an utterance, respectively, /oo/ shows a long vowel. /I/ and /U/ show unvoiced vowels. /N/ shows a syllabic nasal. This Japanese phoneme sequence means ‘‘something felt like larger than myself is . . .’’.

53

a segment with a large local cost in the case of the RMS cost. In order to clarify the relative contributions of the target cost and the concatenation cost to the decrease of the frequency of selecting sequences with large local costs, we investigated the effects of the RMS cost on each cost. Frequency distributions of the target cost are shown in Fig. 12 and those of the concatenation cost are shown in Fig. 13. The corpus size is 32 h. The mean of the target cost is degraded, and the standard deviation increases slightly by using the RMS cost. On the other hand, as for the concatenation cost, the mean value decreases slightly and the standard deviation becomes smaller by using

Fig. 12. Frequency distributions of target cost. ‘‘Av.’’ and ‘‘S.D.’’ show the mean and standard deviation, respectively.

Fig. 11. Frequency distributions of local cost. ‘‘Av.’’ and ‘‘S.D.’’ show the mean and standard deviation, respectively.

criterion, including minimum RMS cost. Although the argument is a little more complicated, it is also reasonable that using the RMS cost for selecting sequences makes the standard deviation of the local cost small. From the figure, it is observed that the frequency of selecting sequences with the large local costs decreases by using the RMS cost. This is a consequence of the large penalty imposed on

Fig. 13. Frequency distributions of concatenation cost. ‘‘Av.’’ and ‘‘S.D.’’ show the mean and standard deviation, respectively.

54

T. Toda et al. / Speech Communication 48 (2006) 45–56

the RMS cost. Especially, we can see that the RMS cost causes a decrease of the number of times that sequences with the large concatenation costs are selected. These results demonstrate that the effectiveness of decreasing the frequency of selecting sequences with large local costs is attributable to the concatenation cost. On the other hand, the mean of the local cost is slightly worse as a consequence of the degradation of the target cost. The tendencies described here were shown in each case of using any size of corpus. However, it might be assumed that these results would be influenced by the weights for sub-costs because we used a weight set in which the weight for the target cost was smaller than that for the concatenation cost, i.e., wt = 0.4, wc = 0.6 in Eq. (7). Therefore, in order to analyze the effects of weights, we also compared the frequency distributions of the local, target, and concatenation costs when the ratios of the target cost to the concatenation cost were set to 1–2, 1–1.5, 1–1, 1.5–1, and 2–1. We obtained the same results for all weight sets. Therefore, the effectiveness mentioned above depends not on the weights for sub-costs but on the function used to integrate the local costs. 4.2. Effect of RMS cost on segment size and concatenation points In order to clarify the mechanism that causes the decrease of the frequency of selecting sequences with large concatenation cost, we analyzed characteristics of the segments selected by minimizing the RMS cost. The mean segment length in the number of syllables is shown in Fig. 14 as a function of corpus size. In calculating the mean segment length, the number of syllables for half-phonemes in a diphone unit is set to 0.5 when the syllable is V, or 0.25 when the syllable is comprised of CV. The mean segment length is shorter in the selection with the RMS cost compared with that in the case of the average cost, while the standard deviation is nearly equal. Moreover, it can be seen that the mean segment length is unexpectedly short, less than 1.35 syllables even when the corpus size is set to more than 10 h. It is reasonable to assume that using short segments is useful for reducing

Fig. 14. Mean segment length in the number of syllables as a function of corpus size.

the variance of the local cost because short segments cause an increase in the number of candidate segments. In general, segment selection does not always have to select the longest segments, since such segments often cause large peaks in the local cost due both to a decrease in the number of allowable concatenation points and to an increase of an error in prosodic parameters. Fig. 14 also shows that the mean segment length increases with corpus size up to 20 h, but starts to decrease beyond this point. This result is caused by the pruning of candidate segments that we use in order to reduce the computational complexity of segment selection. We perform the pruning process, called pre-selection (Conkie et al., 2000), by using the target cost and the mismatch of phonetic environments without taking into account whether segments are connected in the corpus. Therefore, when we use a large-sized corpus that includes many candidate segments with the desired target phonetic environments, the remaining candidate segments are not always connected in the corpus even if these segments have appropriate phonetic environments. Furthermore, in order to clarify which types of concatenation are preferred to when using the RMS cost, we show the concatenation cost and the number of concatenations in each type of concatenation in Fig. 15. The corpus size is 32 h. The concatenation between any phoneme and an unvoiced consonant (‘‘*-unvoiced consonant’’) can often reduce the concatenation cost compared

T. Toda et al. / Speech Communication 48 (2006) 45–56

Fig. 15. Concatenation cost and the number of concatenations in each type of concatenation. Mean and standard deviation are shown. Note that zero concatenation costs m the case of not performing concatenation at segment boundaries are not shown here.

with that between any phoneme and a voiced consonant (‘‘*-voiced consonant’’) because the former type of concatenation has no discontinuity caused by concatenating F0 contours at a segment boundary. It is also found that the concatenation at the phoneme center tends to reduce the concatenation cost compared with the other types of concatenation, although such a concatenation is not frequent in our system. We can see that the RMS cost does not always cause an increase in the number of concatenations for all types of concatenation. The types of concatenation often reducing the concatenation cost are much encouraged, while the types of concatenation that tend to cause the large concatenation cost are discouraged. These results show that the RMS cost tends to suppress audible discontinuity by increasing the number of segments. Consequently, a larger number of segments with shorter lengths, which only cause slight local degradation, are selected by using the RMS cost. These results were also shown in each case of using any size of corpus. 4.3. Estimating MOS from cost We found in Section 3 that the RMS cost has good correspondence to the perceptual score, MOS, as long as a wide range of cost values is involved. This good correspondence makes it possible to estimate the MOS (Chu and Peng, 2001)

55

Fig. 16. Estimated perceptual score as a function of corpus size.

more accurately from the RMS cost than from the average cost. We converted the RMS cost into a perceptual score by using the regression line on the RMS cost shown in Fig. 7. The estimated perceptual score is shown in Fig. 16 as a function of corpus size. As the corpus size becomes larger, the estimated perceptual score is higher and its standard deviation is smaller. We can see that this improvement in the score is small but not saturated even when the corpus size is close to 30 h.

5. Summary In this paper, we evaluated the costs for segment selection on the basis of a comparison with perceptual scores. As a result, we clarified that the average cost, which captures the total degradation, has a better correspondence to the perceptual scores than the maximum cost, which captures the worst local degradation. Furthermore, we found that the root mean square (RMS) cost, which takes into account both the average cost and the maximum cost, has the best correspondence. We also clarified that the naturalness of synthetic speech could be slightly improved by using the RMS cost. We analyzed segment selection based on the RMS cost. From the results of experiments comparing this approach with segment selection based on the conventional average cost, it was found that segment selection based on the RMS cost performed a larger number of concatenations that

56

T. Toda et al. / Speech Communication 48 (2006) 45–56

caused slight local degradation in order to avoid concatenations that would cause greater local degradation. Namely, the RMS cost tended to select a larger number of segments with shorter units. We also performed a perceptual evaluation of the RMS cost in a lower range of the RMS cost. The results clarified that the correspondence of the RMS cost to the perceptual scores is insufficient in this case. That is, the RMS cost is still not accurate enough for making comparisons between similar segments, which is naturally a difficult problem. Therefore, we should further improve the cost function based on perceptual characteristics. In particular, it is necessary to determine the optimum weight set for sub-costs. We will determine this weight set from the results of perceptual experiments on the naturalness of synthetic speech with a set of stimuli covering a wide range in terms of individual sub-costs.

Acknowledgment The research reported here was supported in part by a contract with the National Institute of Information and Communications Technology entitled, ‘‘A study of speech dialogue translation technology based on a large corpus’’. The authors are grateful to Dr. Hiroshi Saruwatari of Nara Institute of Science and Technology for useful discussions on the norm.

References Campbell, W.N., Black, A.W., 1997. Prosody and the selection of source units for concatenative synthesis. In: Van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirschberg, J. (Eds.), Progress in Speech Synthesis. Springer, New York, pp. 279– 292. Chu, M., Peng, H., 2001. An objective measure for estimating MOS of synthesized speech. Proc. EUROSPEECE, Aajborg, Denmark, September 2001. pp. 2087–2090. Chu, M., Peng, H., Yang, H., Chang, E., 2001. Selecting nonuniform units from a very large corpus for concatenative speech synthesizer. In: Proc. ICASSP, Salt Lake City, USA, May 2001. pp. 785–788. Conkie, A., 1999. Robust unit selection system for speech synthesis. Joint Meeting of ASA, EAA, and DAGA, Berlin,

Germany, March 1999. Available from: . Conkie, A., Beutnagel, M., Syrdal, A.K., Brown, P.E., 2000. Preselection of candidate units in a unit selection-based textto-speech synthesis system. In: Proc. ICSLP, Vol. 3, Beijing, China, October 2000. pp. 279–282. Ding, W., Fujisawa, K., Campbell, N., 1998. Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification. In: Proc. 3rd ESCA/COCOSDA Internat. Workshop on Speech Synthesis, Jenolan Caves, Australia, November 1998. pp. 191–194. Hunt, A.J., Black, A.W., 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. ICASSP, Atlanta, USA, May 1996. pp. 373– 376. Kawahara, H., Masuda-Katsuse, I., de Cheveigne´, A., 1999. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction: possible role of a repetitive structure in sounds. Speech Comm. 27 (3–4), 187–207. Kawai, H., Tsuzaki, M., 2002. Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis. in: Proc. ICSLP, Denver, USA, September 2002. pp. 2621–2624. Klabbers, E., Veldhuis, R., 2001. Reducing audible spectral discontinuities. IEEE Trans. Speech Audio Process. 9 (1), 39–51. Lee, M., 2001. Perceptual cost functions for unit searching in large corpus-based concatenative text-to-speech. In: Proc. EUROSPEECH, Aalborg, Denmark, September 2001. pp. 2227–2230. Peng, H., Zhao, Y., Chu, M., 2002. Perpetually optimizing the cost function for unit selection in a TTS system with one single run of MOS evaluation. In: Proc. ICSLP, Denver, USA, September 2002. pp. 2613–2616. Sagisaka, Y., 1988. Speech synthesis by rule using an optimal selection of non-uniform synthesis units. In: Proc. ICASSP, New York, USA, April 1988. pp. 679–682. Stylianou, Y., Syrdal, A.K., 2001. Perceptual and objective detection of discontinuities in concatenative speech synthesis. In: Proc. ICASSP, Salt Lake City, USA, May. pp. 837– 840. Syrdal, A.K., Wightman, C.W., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K-S., Makashay, M.J., 2000. Corpus-based techniques in the AT&T NextGen synthesis system. In: Proc. ICSLP, Vol. 3, Beijing, China, October 2000. pp. 410–415. Toda, T., Kawai, H., Tsuzaki, M., Shikano, K., 2002. Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit. In: Proc. ICASSP, Orlando, USA, May 2002. pp. 465–468. Wouters, J., Macon, M.W., 1998. A perceptual evaluation of distance measures for concatenative speech synthesis. In: Proc. ICSLP, Sydney, Australia, December 1998. pp. 2747– 2750.

Suggest Documents