DOMAIN ADAPTATION METHODS IN THE IBM TRAINABLE TEXT-TO-SPEECH SYSTEM V. Fischer, J. Botella Ordinas, S. Kunzmann IBM Voice Systems, European Voice Technology Development Gottlieb-Daimler-Straße. 12, D-68165 Mannheim, F.R. of Germany
[email protected]
ABSTRACT This paper presents a comparison of domain adaptation techniques for a unit selection based text-to-speech system. The methods under investigation consider two different prerequisites, namely the absence and the existence of additional domain specific training prompts, spoken by the original voice talent. Whereas in the first case we employ domain specific pre-selection, for the latter we compare a variety of methods that range from a simple extension of the segment inventory to a complete reconstruction of the system, which also includes the training of decision trees for the domain dependent prediction of prosody targets. An experimental evaluation of the methods under consideration unveils significant improvements (up to 1.1 on a 5 point MOS scale) over the baseline system for sentences from the target domain, while showing no significant degradation when synthesizing sentences from other than the adaptation domain. 1. INTRODUCTION Natural sounding speech output is one of the key elements for a wide acceptance of natural language understanding systems, and is indispensable for voice driven interfaces that can not make use of other output modes, like e.g. plain text or graphics. Recently, major progress in the field has been made by the development of corpus based text-to-speech systems, cf. e.g. [1, 2, 3], which make use of short intervals of stored speech that are retrieved, concatenated, and manipulated during synthesis. While this approach achieves almost natural speech output for a wide variety of applications, it is also true that best results are obtained for those domains that have been thoroughly covered by the training corpus [4]. However, since it is apparently impossible to adequately cover all potential applications in advance, there appears to be a need for domain specific adaptation, a concept which has been successfully applied in areas such as automatic speech recognition before. Previous work in this direction deals with the fast and reliable construction of speech data-
bases that are close to the target domain [4] and also concentrates on the creation of optimized adaptation scripts [5]. Based on the considerations above, the work presented here investigates various methods for the adaptation of a concatenative text-to-speech synthesizer towards a particular application or domain. Our study deals with two different prerequisites, namely the absence and the existence of additional domain specific training prompts spoken by the original voice talent, and it employs adaptation of the back-end database as well as of the front-end. In doing so, one particular goal of this work is to retain the flexibility of the unit selection approach: while sentences from the adaptation domain should be synthesized with improved quality, the modified system should produce speech in the quality of the baseline system when synthesizing text from domains other than the adaptation domain. The remainder of this paper is organized as follows: Section 2 briefly describes the recent status of the IBM trainable speech synthesis system that serves as a test bed for our investigations. Here we concentrate on system features that enable the use of data driven adaptation techniques, which are discussed and evaluated in Sections 3 and 4, respectively. Finally, Section 5 gives a conclusion and an outlook on future work. 2. SYSTEM OVERVIEW The evolution of the text-to-speech system at IBM is documented in [2, 6], and the most recent version is described in some detail in [7]. The system’s basic synthesis units are subphoneme-sized speech segments, which correspond to context-dependent Hidden-Markov-Model states that are identified by decision tree growing during the construction of the system. While previous versions of the system were based on a script of 1400 - 2000 phonetically balanced sentences, the training of the system now relies on a script of about 10K sentences (15 hours of speech, including silence) that were recorded in a professional recording studio. The extended recording script additionally includes a mixture of newspa-
per articles and emails about different topics, weather forecasts, proper names (cities, persons, company names), digits and natural numbers, dates and times, and a variety of prompts that are related to various popular voice driven applications (e.g. air travel information and finance). In order to obtain most accurate alignments for the creation of the acoustic unit inventory from the recorded speech, we used speaker specific baseforms for the construction of general purpose synthesizers, which — similar to the approach described in [8] — includes extensions to the native phonologies for languages that exhibit a large number of English loan words. However, we found some benefits from the use of speaker independent baseforms (as produced by the rule-based front-end) for adaptation of the synthesizer towards a particular domain, cf. Section 4. During synthesis, an independent rule-based front-end is used to perform text normalization, text-to-phone conversion, and phrase boundary generation. Preprocessed phrases are passed to the back-end that employs a Viterbi beamsearch to generate the synthetic speech. Recent algorithmic changes include a data driven approach to segment pre-selection [9], and the introduction of a decision tree based approach for the prediction of pitch and duration targets that has meanwhile replaced the rulebased component in the front-end [7].
¯ Alternatively, we also omitted the growing of a new acoustic context tree, and enlarged the segment inventory by making use of the baseline synthesizer’s context definitions. ¯ Pre-selection: In order to decrease the system size without reducing the speech output quality segments are retained based on the number of times they were used in synthesizing a test corpus of several tens of thousands of sentences [9]. Adaptation is done by creating segment counts from a domain specific corpus, and the subsequent smoothing with counts computed from the larger, domain independent corpus. Note that this method is particular inexpensive, since it can be applied without the recording of additional prompts. Finally, we experimented with speaker independent (frontend) baseforms for the alignment of the adapatation data. Here, the idea is to enforce the creation of longer contiguous segments during synthesis by providing a better match between domain specific acoustic contexts in the segment database and those produced by synthesizer’s front-end.
4. EXPERIMENTS 3. DOMAIN ADAPTATION In many of today’s voice driven applications speech output is produced from text which is generated by a kind of answer generation module. The number of application specific prompts is therefore limited in principle, but potentially high nevertheless, and — due to design modifications — new prompts may be needed during the lifetime of an application; consider a stock information system with hundreds of company names coming and going as an example. The sketched scenario requires the adapted synthesizer to be flexible enough to produce high quality speech for text not seen in the adaptation data (which makes the use of prerecorded prompts prohibitive), and it suggests data driven adaptation methods in order to easily cope with dynamically changing applications. Consequently, we consider decision tree growing and pre-selection as starting point for adaptation:
¯ Decision tree growing: Acoustic context definitions and prosody target prediction were adapted by growing the respective decision trees from a data set that is augmented with domain specific prompts. In order to achieve a stronger bias towards domain specific intonation we also created pitch and duration prediction trees solely from the adaptation data.
Domain specific adaptation of both the segment inventory and the prosody targets were experimentally evaluated for a male German voice. The baseline system (system A in the figures below) was created from 5000 sentences (approximately 7 hours of speech). The recording script contained 2000 phonetically balanced sentences plus 3000 randomly selected sentences from the remainder of the recording script, cf. Sec. 2, that originate from all but the adaptation domain. After employing a text corpus of approx. 100k sentences for pre-selection, the baseline system was carefully tuned and parameters were fixed for all subsequent adaptation experiments. The adaptation domain is defined by 250 sentences that represent the most frequently answers given by a dialog system that is able to handle queries about the German train time table; 200 out of the 250 sentences served as adaptation data, while the remaining 50 sentences were used for testing. Constructed by an answer generation module that uses templates, there is at least one word in each of the 50 test sentence — usually a proper name or a time or date expression — which is neither included in the adaptation script nor in the 5000 sentences used for the training of the baseline system. Based on the data described above, the baseline system was adapted as follows:
¯ For system C the 200 prompts were added to the training corpus, and the entire system was rebuild from the augmented data set. ¯ System D uses the baseline system’s acoustic context tree and domain independent prosody target prediction. Domain specific prompts were added to the segment inventory after pre-selection, which slightly increases the database size, but ensures that the entire adaptation data is available for synthesis.
0,35 0,3
splice rate
¯ For system B no additional recordings were used, but 200 adaptation sentences were added to the pre-selection corpus. Segment counts created from the adaptation sentences were multiplied with a heuristically determined constant ( ) in order to give more weight to the much smaller adaptation corpus.
In addition to experiments with the adaptation test script (50 sentences, TEST in the figures below) we also re-synthesized the adaptation training script (200 sentences, TRAIN), and carried out experiments with a portion of our regular, domain independent test script (10 sentences, GENERAL). While the first reflects our primary interest in the synthesis of text from an unlimited, but restricted domain, re-synthesis of the training script enables us to evaluate adaptation methods under a closed domain assumption, and the latter allows to measure the effects of adaptation on the synthesis of general text. For a quick evaluation of the various systems we measured the average number of non-contiguous segment concatenations (the splice rate), the average length of contiguous segments in the synthetic speech,and the average synthesis costs. The latter are a weighted sum of three subcosts that capture spectral and pitch discontinuities at segment boundaries and the deviation of a candidate segment’s pitch and duration from the required targets. Recently, all these measures have been found to correlate with the perceived naturalness of synthetic speech under various experimental conditions, cf. [10, 11]. Figures 1 – 3 give results for the three different test cases considered in our experiments. While all adaptation methods affect the synthesis of general text only to a very small degree, we see small benefits when employing domain specific pre-selection (system B) only in case of the closed do-
0,2 0,15 0,1
TRAIN
TEST
GENERAL
B
C
0,05 A
D
E
F
Fig. 1. Average splice rate in synthetic speech for three different scripts (see text). System A denotes the baseline system, B - F denote domain adapted synthesizers.
¯ System E was created in the same fashion like system D, but used speaker independent front-end baseforms in order to achieve longer contiguous segments.
0,4 TRAIN
TEST
GENERAL
0,35 average segment length [s]
¯ Finally, system F additionally makes use of domain specific decision trees for prosody target prediction. Those were trained solely from the 200 adaptation utterances and no attempts of smoothing the targets were made.
0,25
0,3 0,25 0,2 0,15 0,1 A
B
C
D
E
F
Fig. 2. Average length (in seconds) of contiguous segments in synthetic speech for three different scripts.
main assumption (script TRAIN). Results for systems C F demonstrate the benefits of using additional prompts for domain adaptation. As expected, the effects of adaptation are more evident for the training corpus. The almost equal figures for systems C and D further suggest that domain specific acoustic contexts are already well presented in the baseline system. The use front-end baseforms for the alignment of the adaptation data and domain specific prosody targets (systems E and F) provide a further reduction of average synthesis costs and splice rate, but start to slightly increase the average synthesis costs for out of domain text. As a reason for the latter, we found a strong increase of pitch target costs. For the case of restricted, unlimited domain synthesis (script TEST) we also conducted a 5 point scale listening test, which included original recordings of the script as well. Mean opinion scores are given in Fig. 4, showing a relative increase of 51.6 percent (1.1 on the MOS scale) for system E. Systems B - F all perform statistically significant better
1,1
Acknowledgements. The authors would like to thank all participants in the listening tests, and would like to acknowledge the many fruitful discussions with the members of the text-to-speech research and development group within IBM, in particular with Maria Smith, Raimo Bakis, and Jorge Gonzalez.
relative synthesis cost
1 0,9 0,8 0,7
6. REFERENCES 0,6 TRAIN
TEST
GENERAL
0,5 A
B
C
D
E
F
Fig. 3. Average (normalized) costs for synthesis of text from three different scripts
[1] A. Hunt and A. Black, “Unit Selection in a Concatenative Speech Synthesis System using a Large Speech Database,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Atlanta, 1996, vol. 1, pp. 373–376. [2] R. Donovan and E. Eide, “The IBM Trainable Speech Synthesis System,” in Proc. of the 5th Int. Conf. on Spoken Language Processing, Sydney, 1998. [3] M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A.K. Syrdal, “The AT&T Next-Gen TTS System,” in Proc. of the Joint Meeting of ASA, EAA, and DAGA, Berlin, Germany, 1999.
5
4
[4] A. Black and K. Lenzo, “Limited domain synthesis,” in Proc. of the 6th Int. Conf. on Spoken Language Processing, Beijing, 2000, pp. 411–414.
3
[5] M. Chu, C. Li, H. Peng, and E. Chang, “Domain Adaptation for TTS Systems,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Orlando, Fl., 2002.
2
1
0
A
B
C
D
E
F
orig
Fig. 4. Listening test results (5 point scale) for the adaptation test script (TEST) for the baseline synthesizer (A), adapted synthesizers (B-F), and original speech. than the baseline ( , but differences between the adapted systems are mostly insignificant.
5. CONCLUSIONS In this paper we have addressed the adaptation of a unit selection based speech synthesizer towards a particular domain. We have found both a significant reduction of synthesis costs and number of non-contiguous segments in the synthetic speech when using a medium-sized amount of adaptation data, and have also obtained significantly improved MOS-scores. Future work will focus on a reduction of the amount of adaptation data and the use of data from a different speaker. The relative poor correlation between MOSscores and splice rate ( and synthesis costs ( ) suggests that additional improvements can be obtained from a domain specific tuning of the cost function, which will also be subject of further investigations.
[6] R. Donovan, A. Ittycheriah, M. Franz, B. Ramabhadran, E. Eide, M. Viswanathan, R. Bakis, W. Hamza, M. Picheny, P. Gleason, T. Rutherfoord, P. Cox, D. Green, E. Janke, S. Revelin, C. Waast, B. Zeller, C. Guenther, and S. Kunzmann, “Current Status of the IBM Trainable Speech Synthesis System,” in Proc. of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Edinburgh, Scotland, 2001. [7] E. Eide, A. Aaron, R. Bakis, P. Cohen, R. Donovan, W. Hamza, T. Mathes, M. Picheny, M. Polkosky, M. Smith, and M. Viswanathan, “Recent Improvements to the IBM Trainable Speech Synthesis System,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Hong Kong, 2003. [8] M. Jilka and A.K. Syrdal, “The AT&T German Text-toSpeech System: Realistic Linguistic Description,” in Proc. of the 7th Int. Conf. on Spoken Language Processing, Denver, Colorado, 2002. [9] W. Hamza and R. Donovan, “Data-driven Segment Preselection in the IBM Trainable Speech Synthesis System,” in Proc. of the 7th Int. Conf. on Spoken Language Processing, Denver, 2002. [10] M. Chu and H. Peng, “An Ojective Measure for Estimating MOS of Synthesized Speech,” in Proc. of the 7th Europ. Conf. on Speech Communication and Technology, Aalborg, Denmark, 2001. [11] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano, “Perceptual Evaluation of Cost for Segment Selection in Concatenative Speech Synthesis,” in Proc. of the IEEE 2002 Workshop on Speech Synthesis, Santa Monica, Ca., 2002.