Corpus-based speech synthesis[1]-[3] using unit selection and .... selected units, and is calculated by d i syn = 1/N ·. âN n=1 d i n, where d i syn and d.
UNIT SELECTION SPEECH SYNTHESIS USING MULTIPLE SPEECH UNITS AT NON-ADJACENT SEGMENTS FOR PROSODY AND WAVEFORM GENERATION Masatsune Tamura, Norbert Braunschweiler†, Takehiko Kagoshima, and Masami Akamine Corporate Research and Development Center, Toshiba Corporation 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki, 212-8582, Japan. † Toshiba Research Europe Limited.Cambridge Research Laboratory 208 Cambridge Science Park, Milton Road, Cambridge, CB4 0GZ, UK ABSTRACT In this paper, we propose a speech synthesis method that combines a natural waveform concatenation based speech synthesis method and our baseline plural unit selection and fusion method. Two main features of the proposed method are (i) prosody regeneration from selected speech units and (ii) using multiple speech units at non-adjacent segments. The nonadjacent segments is the segment that the previous or following speech units in the optimum speech unit sequence are not adjacent in the database. By using the prosody of selected speech units, the original prosodic expressions and sounds of recorded speech are retained, while discontinuities are reduced by using multiple speech units at non-adjacent segments. MOS evaluations showed that the proposed method provides a clear improvement against the conventional unit selection method and our baseline method. Index Terms— concatenative speech synthesis, unit selection, prosody generation, unit fusion 1. INTRODUCTION Corpus-based speech synthesis[1]-[3] using unit selection and concatenation is one of the most widely used speech synthesis methods for both academic and commercial purposes. Some of these systems use waveforms of the selected speech units without prosodic modification, while others generate waveforms by modifying fundamental frequency (f0 ) contours and durations of selected speech units according to the input prosody. In the former method, the speech quality of the synthetic speech is close to the natural speech. However, even with very large speech databases, some discontinuity of speech units or unnatural prosody caused by the mismatch between the target and the selected speech unit is inevitable. In the method that modifies f0 and duration, although it can generate precise prosody for input accents and intonations, unnatural robotic prosody or degradation of the speech quality caused by prosodic modification can occur. We have proposed a speech synthesis method using plural unit selection and fusion method[4][5]. This synthesis method modifies f0 and phone durations. The f0 contours used in the method are generated by an f0 codebook method[6] and phone durations are predicted by a multiple linear regression method[7]. To reduce degradation by
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
4802
prosodic modifications and discontinuity, it selects multiple speech units for each half-phone segments, then generates waveforms by using speech units that represent the selected multiple speech units by averaging pitch-cycle waveforms. Although it can generate stable and human-like speech, it still lacks naturalness when compared to recorded speech. In this paper, we propose a speech synthesis method that combines the two types of speech synthesis methods just described. Two main features of the proposed method are (i) prosody regeneration from selected speech units and (ii) using multiple speech units at non-adjacent segments. Input to the system is the phoneme sequence and prosody. and then the optimum speech unit sequence is selected using target costs and concatenation costs. If the previous and following adjacent speech units for a speech segment in the optimum speech unit sequence are also adjacent in the original speech utterance in the database, it uses a single optimum speech unit for waveform generation. Otherwise, in case that the segments is discontinuous in the database (non-adjacent segments), it selects multiple speech units and uses them for waveform generation. This procedure enables smooth concatenation of the non-adjacent segments, while it retains the original speech quality for the other segments. Then, the prosody is re-generated from the selected single or multiple speech units to retain the prosodic expressions of the original speech. Finally, the prosody of the speech units is modified according to the regenerated prosody and the generated speech units are concatenated to synthesize speech. This paper provides an overview of the system (section 2.), describes the proposed method (section 3.), presents experimental results (section 4.), and conclusions (section 5.). 2. THE SPEECH SYNTHESIS SYSTEM Figure 1. depicts the proposed speech synthesis system. The speech unit database has speech units that contains waveforms for the segment, pitch-marks, prosodic attributes, phonetic context attributes and grammatical context attributes. The adjacent speech unit information that points to the previous and following adjacent speech units in the original speech utterance are also included in the database. We use a halfphone as a smallest speech unit. The phoneme sequence, the prosody generated in the prosody generation modules, along
ICASSP 2010
v-L S01
phoneme sequence, prosody unit selection
v-R S01
ae-L S01
optimum speech unit sequence search 㼚㼛㼚㻙㼍㼐㼖㼍㼏㼑㼚㼠㻌㼟㼑㼓㼙㼑㼚㼠㻫 yes no multiple speech unit selection multiple speech units
l-L S02
l-R S02
ii-L S03
S04
S06
S03
S09
S05
S07
S08
S10
single unit
multiple units
ii-R S03
single
single speech unit
prosody regeneration prosody fusion
ae-R S01
v unit prosody
ae
l
ii
S01:Valerie's, S02:challenge, S03:early
prosody concatenation
Fig. 2. unit selection based on adjacent speech units
waveform generation waveform fusion
unit waveform
unit concatenation synthetic speech
Fig. 1. speech synthesis system with attribute information for unit selection is given as input to the system. In the unit selection part, speech units for each half-phone segment are selected. First, the optimum speech unit sequence is selected using a cost function. The cost function consists of target costs and concatenation costs. The target cost is defined by the weighted sum of the f0 target cost, the duration target cost, the phonetic context cost, and the grammatical context cost. The concatenation cost is defined by the f0 concatenation cost, the spectrum concatenation cost, the power concatenation cost, and the adjacency cost (set to 0 when two consecutive units are adjoining in the speech unit database, otherwise 1). The weights for the cost function are determined manually. For each segment, it is checked whether it includes nonadjacent segment or not. This is realized by checking whether the adjacent speech unit for the segment is also adjoining in the speech utterance in the database, according to the adjacent unit information. If the speech unit selected in the previous or following segments are not adjacent in the original speech utterance, it is identified as a non-adjacent segment. If the segment is checked as a non-adjacent segment, multiple speech units for each half-phone are selected. This is done by selecting the n-best speech units using cost function that consists of the concatenation cost from the previous optimum speech unit, the concatenation cost to the following optimum speech unit, and the target cost. For selected units which are adjacent to their previous and following units in the original speech corpus, optimum single speech unit is selected. Then, in the prosody regeneration part, the phone duration and f0 contour is regenerated using selected speech units. For the non-adjacent segment, the phone durations and the f0 contours of selected speech units are averaged to make re-
4803
generated prosody (prosody fusion). For the other segment, the selected optimum speech unit’s duration and f0 contour are used. The averaged f0 contours and the speech unit’s f0 contours are smoothed and concatenated to generate a f0 contour for the input sentence. The f0 smoothing is done by offset value addition, linear interpolation, and spline smoothing. The offset value addition shifts the f0 contour to reduce the difference in the boundary. In the waveform generation part, pitch-cycle waveforms that are generated from the selected multiple speech units are used in the non-adjacent segments. In the other segments, pitch-cycle waveforms extracted from a single speech unit are used. The pitch-cycle waveforms for the non-adjacent segments are generated by extracting sub-band waveforms by applying band-pass filters to the pitch-cycle waveforms extracted from each of multiple speech units, alignment the subband waveforms by searching the maximum correlation time lag, averaging the aligned sub-band waveforms, and summing them into the whole band pitch-cycle waveforms as in the baseline method (waveform fusion). By overlap-adding the generated or extracted pitch-cycle waveforms, the waveforms for the segments are generated. Finally, speech waveform is synthesized by concatenating these waveforms. 3. MULTIPLE UNIT SELECTION AND PROSODY REGENERATION 3.1. Unit selection based on adjacent speech units Figure 2. illustrates the proposed unit selection method. It selects multiple speech units for non-adjacent segments, and it use a single speech unit in the optimum speech unit sequence for the other segments. In figure 2., a word “valley” is synthesized by the proposed algorithm. The speech unit database includes the utterances S01 to S10. In the optimum speech unit sequence, the speech units “v-L”, “v-R”, “ae-L”, “ae-R” are selected from S01, “l-L”, “l-R”, are selected from S02, “iiL”, “ii-R” are selected from S03. “-L”, “-R” denotes the left half-phone, and the right half-phone, respectively. From these results, “ae-R”, “l-L”, “l-R”, “i-L” speech units have adjacent speech units that are not adjoined in the source utterances in the database. For these non-adjacent segments, multiple speech units are selected from the other sentences. Prosody
9 in syn 8
7 f
i
f
t
ii
m ai
l
z
f
r @m
p r
ii
v ii
@s
d
e
s
t
i
n ei
sh
@ n
Fig. 3. regenerated prosody example: “Fifty miles from previous destination” and waveform for the non-adjacent segments are generated from these units. For “v-L”, “v-R”, “ae-L”, “ii-R” segments, prosody and waveform of optimum speech units are used. In this figure, the number of multiple units for non-adjacent segments is set to 3. The waveform for a whole sentence is generated by concatenating the generated waveforms.
3.2. Prosody regeneration using selected speech units In the prosody regeneration part, durations and f0 contours are generated from the selected speech units. For the nonadjacent segments, the duration and the f0 contours are generated from the selected multiple speech units. For other segments, the duration and f0 of selected speech units are used. Let N be the number of multiple speech units selected in the non-adjacent segments. The duration for the non-adjacent segment is generated by averaging the durations across the N selected units, and is calculated by disyn = 1/N · n=1 din , where disyn and din , represent the generated duration and the duration of n-th selected speech unit for the i-th segments, respectively. The f0 contours for the segments are generated by mapping each unit’s f0 frames to the frames in the generated unit’s f0 , and averaging the mapped f0 for each frame. That is, the f0 contour is calculated by N 1 i di f0 isyn (t) = f0 n (t · i n ), (1) N n=1 dsyn where f0 isyn (t) and f0 in (t) represent the generated f0 and f0 of the n-th selected speech unit for the i-th segments at time t, respectively. Even if the averaged f0 contour is used, the effect of a trembling f0 can occur which is caused by discontinuities between left and right half-phone, especially in vowels still remains. To reduce the trembling sound caused by the f0 discontinuity, f0 smoothing by an offset addition, interpolation, and spline smoothing is applied. An offset addition is done by adding offset value to the f0 contours of speech units. It is applied when adjacent speech units are not consecutive speech units in the center of voiced phonemes. In i this case, fˆ0 syn (t) for i-th segment is generated by i fˆ0 syn (t) = f0 isyn (t) + offset i .
(2)
Average f0 value of end-point of left half-phone f0 isyn (T ) and start-point of right half-phone f0 i+1 syn (0) is calculated. The offset value offset i is determined as follows for the left half-
4804
phone: 1 i i (3) (f0 (T ) + f0 i+1 syn (0)) − f0 syn (T ), 2 syn where f0 isyn (T ) represents the f0 value at the end-point of the left half-phone and f0 i+1 syn (0) the f0 value at the start-point of the right half-phone. And for the right half-hone, it is calculated by 1 offset i = (f0 i−1 (T ) + f0 isyn (0)) − f0 isyn (0). (4) 2 syn f0 smoothing by interpolation is also applied to the f0 contours. It is calculated by weighted sum of end-point or startpoint f0 of adjacent units and f0 contour in the current speech unit. The weight is set by gradually decreasing value from end-or-start point in the speech unit. Finally, spline smoothing is applied which generates a smooth f0 contour by minimizing the error function defined by, E = p||y − s||2 + (1 − p)s D k D k s (5) where y,s,p,Dk represents smoothed f0 contour and input f0 contour, smoothing parameter, and a matrix that gives k-th order differential function, respectively. Figure 3. shows an example of the f0 generation method from selected units. The text is “Fifty miles from previous destination.” and the f0 contour for each half-phone in voiced phonemes is shown as segmented lines. Since duration is not same, f0 contours are drawn by aligning the head of phoneme segments. The dashed line represents the f0 contour generated by the f0 codebook method and the solid line the output f0 generated from the speech units. From this figure, we can see that the f0 range becomes larger compared to the f0 range in the codebook generated f0 . This difference is especially visible in the end parts of the sentence. By listening to this example, input f0 is too low at the end of the sentence. The end-of-sentence f0 contour generated from the proposed method become appropriate. As a result, a more natural f0 contour can be generated by the proposed method. offset i =
3.3. Unit selection improvements When using the prosody of selected speech units, the unit selection failure in terms of accents and intonations strongly affects the synthetic speech. In conventional methods, since (generated) input prosody is used for waveform generation, prosody mismatches of the selected speech units are not a large problem. For the proposed method, stable and natural
unit selection is required. Therefore, the cost function is refined to improve unit selection. First the f0 concatenation cost calculation was modified. The conventional f0 concatenation cost calculation returns 0, if the previous speech unit is unvoiced. This means that we cannot consider f0 difference when an unvoiced speech unit is placed between voiced speech units. This causes an abrupt jump on in the voiced f0 . To prevent this, the f0 concatenation cost is calculated between the voiced phoneme after an unvoiced phoneme and the previous voiced phoneme before the unvoiced phoneme. By backtracking the DP path from previous candidate unvoiced speech units, the previous voiced speech unit searched from the candidate unit is determined. The f0 concatenation cost is calculated between the previous voiced speech unit and the current voiced unit. Other modifications concern the calculation of grammatical context cost. Whereas the effect of grammatical context cost is not large in the conventional method, we should carefully design it for the proposed method. For grammatical context cost, distance in syllables from the beginning and end of the current sentence, breath group, and word are calculated, along with the distance in syllables to the accented syllable in the word. The sentence style cost that evaluates whether it is a yes-no-question, wh-question, exclamation, or normal sentence is also added to the cost function. 4. EXPERIMENTS To show the effectiveness of the proposed method, an MOS test was conducted. The test compared synthetic speech generated by four different system. The first system, “BASELINE”, was our baseline plural unit selection and fusion method, with the units modified to match the predicted input prosody. The second system, “SINGLE”, is a more conventional unit selection system which selects a single unit for each half-phone and uses the prosody of the selected units. The “MULTIPLE” system differs from “BASELINE” in that it regenerates the final prosody from the prosody of the multiple selected units for each half-phone. Finally, the “PROPOSED” system uses both single and multiple selected units based on adjacency, and regenerates the prosody from the selected units. For all systems with multiple speech units for half-phone segments, the number of speech units for the segment was set to 10. Sixteen subjects, all native American English speakers who were not speech researchers took part in the evaluation. Their age ranged from 20 - 60 and male (8) and female (8) participants were equally distributed across the age range. The evaluation included a total of 45 sentences. Each subjects evaluate 180 sentences. Subjects had to rate the quality of the stimuli on a 5-point with the following ranking: 5: excellent, 4: good, 3: fair, 2: poor, 1: bad. The speech unit database was created by an American English female speaker and included 2611 sentences (3.17 hours). The results of the evaluation are shown in Figure 4. The results shows that the proposed method received the highest MOS score. The p-value of paired-t test between “PROPOSED” and “BASELINE”, “PROPOSED” and “SINGLE”, and “PROPOSED” and “MULTIPLE” is 0.00, 0.011, 0.002,
4805
4.0
3.5
3.26
3.39
3.36
3.51
3.0
2.5 BASELINE
SINGLE
MULTIPLE PROPOSED
Fig. 4. MOS evaluation results respectively. Thus, significant difference (p-value ≤ 0.05) can be observed between proposed method and the conventional methods. 5. CONCLUSION In this paper, we propose a speech synthesis method that combines natural waveform concatenation based speech synthesis method and plural speech unit selection and fusion method. Two main features of the proposed method are prosody regeneration from selected speech units and using multiple speech units at non-adjacent segments. A listening test showed that the MOS score of the proposed method is significantly higher than the conventional methods. Our future work is to improve the stability of the proposed method and to create a small footprint system. 6. REFERENCES [1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” Proc. ICASSP-96, pp. 373–376, 1996. [2] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano, “Unit selection algorithm for japanese speech synthesis based on both phoneme unit and diphone unit,” Proc. ICASSP 2002, pp. 465–468, 2002. [3] A. K. Syrdal, C. W. Wightman, A. Conkie, Y. Stylianou, M. Beutnagel, J. Schroeter, V. Strom, and K. Leeand M. J. Makashay, “Corpus-based techniques in the AT&T NEXTGEN synthesis system,” Proc. ICSLP2000, pp. 410–415, 2000. [4] T. Mizutani and T. Kagoshima, “Concatenative speech synthesis based on the plural unit selection and fusion method (in English),” IEICE Trans., vol. E88-D, no. 11, pp. 2565–2572, 2005. [5] M. Tamura, T. Mizutani, and T. Kagoshima, “Fast concatenative speech synthesis using pre-fused speech units based on the plural unit selection and fusion method (in English),” IEICE Trans., vol. E90-D, no. 2, pp. 544–553, 2007. [6] T. Kagoshima, M. Morita, S. Seto, and M. Akamine, “An f0 contour control model for totally speaker driven text to speech system,” Proc. ICSLP’98, pp. pp.1975–1978, Dec. 1998. [7] C. Hayashi, “On the quantification of qualitative data from the mathematico-statistial point of view (an approach for applying this method to the parole prediction),” Annals of the Institute of Statistical Mathematics, pp. pp.35–47, Dec. 1950.