HIGH-QUALITY SPEECH OUTPUT GENERATION THROUGH ADVANCED PHRASE CONCATENATION E.A.M. Klabbers IPO: Center for Research on User-System Interaction P.O. Box 513, 5600 MB Eindhoven, the Netherlands
[email protected] ABSTRACT This paper describes a method for generating natural sounding speech, called phrase concatenation, which is used in a telephone inquiry system that provides train timetable information. The concatenation technique used combines pre-recorded words and phrases, but is new in that it involves the recording of several prosodically dierent versions of otherwise identical phrases. Although no formal evaluation has taken place yet, we feel con dent in saying that the output meets high quality standards and approaches the quality of natural speech.
1. INTRODUCTION During the last decade, the performance of spoken dialogue systems has improved substantially. At the moment it is possible to support a number of simple practical tasks in limited domains. As a result, many telephone-based information systems are being developed in dierent countries. The practical goal of the NWOTST Priority Programme is to build a prototype of a Dutch train timetable information system. The system is called OVIS, a Dutch acronym for Openbaar Vervoer Informatie Systeem (Public Transport Information System). The scienti c goal of the programme is to obtain a deeper basic knowledge in each of the research areas involved, viz. speech recognition, linguistic analysis, dialogue management, language generation and speech output generation. In this contribution, a highquality speech output generation technique for OVIS is presented, which concatenates pre-recorded words and phrases taking prosodic properties into account. This research is carried out within the framework of the Priority Programme Language and Speech Technology (TST). The TST-Programme is sponsored by the Netherlands Organization for Scienti c Research (NWO).
2. SPEECH OUTPUT GENERATION METHODS In spoken dialogue systems, where human users interact with computers over the public telephone network, it is essential that the voice output interface be of high quality. Both the intelligibility and the naturalness should be suciently high. There are several methods for providing a system with speech output, each with their own advantages and disadvantages. Three methods will be distinguished here, viz. the use of prerecorded speech, speech synthesis and speech concatenation.
2.1. The use of pre-recorded speech
A maximum degree of naturalness can be achieved by playing back digitally stored natural speech. The quality of the speech output is limited only by the medium, e.g. a standard telephone channel, through which it is transmitted. There are several disadvantages to this approach. Firstly, memory limitations will become a problem once the vocabulary of the system becomes moderately large. Secondly, the approach is highly in exible in that entire messages have to be recorded once again to update the vocabulary. And thirdly, variation in accentuation, word order and phrasing cannot be dealt with.
2.2. Speech synthesis
An alternative that yields a maximum degree of exibility is the use of synthetic speech. This method requires much less storage than stored-waveform techniques. State-of-the-art synthesis is most often based on diphone concatenation, which shows a high rate of intelligibility in laboratory situations, but decreases signi cantly in telephone situations. Recent evaluation of three Dutch speech synthesizers (Rietveld, Kerkho, Emons, Meijer, Sanderman, and Sluijter 1997), has shown that in a PSTN (standard telephone) condition
the average intelligibility for diphone synthesis is about 70%., whereas in a GSM condition the intelligibility decreases to about 57%. Furthermore, synthetic speech is still far from natural.
2.3. Speech concatenation
The key is to nd a balance in the trade-o between naturalness and exibility. In that respect, concatenating pre-recorded units like words and phrases appears to be a good alternative. With this approach, a large number of utterances can be pronounced on the basis of a limited set of pre-recorded phrases, saving memory space and increasing exibility. This technique is practical only if the application domain is limited and remains rather stable, as is the case with train timetables. Moreover, storage capacities must be suciently large, but this is not expected to be a problem for current telecom servers. The use of concatenative speech in limited-domain applications such as OVIS is quite common. It is used in commercial applications such as the speaking clock, telephone banking systems, market research teleservices and travel information services. But often the method is so straightforward that it is not even mentioned. In the German train timetable system (Aust, Oerder, Seide, and Steinbiss 1995) and in the rst version of OVIS which is based on this German system, rather natural speech output was obtained by simply recording the necessary words and phrases and playing back the concatenated sentences when required. This approach has two major bottlenecks: Firstly, very careful control of the recordings is needed. Usually, this is not taken into account, so that dierences in loudness, rhythm and pitch patterns can occur, leading to dis uent speech. Phrases seem to overlap in time and create the impression that several speakers are talking at the same time, at dierent locations in the room. In order to disguise these prosodic imperfections, pauses are often inserted, which are clearly audible and make the speech sound less natural. Secondly, the words containing variable information such as station names, times and dates are recorded in one prosodically neutral version only. This makes it practically impossible to exploit the two most important functions of prosody, and especially intonation, namely: 1. highlighting information structure by means of accentuation, i.e. by accenting important and new information while deaccenting old or given information.
2. highlighting linguistic structure by means of prosodic phrasing, i.e. by melodically marking certain syntactic boundaries and by using pauses at the appropriate places. The straightforward concatenation technique cannot deal properly with the variability in accentuation and phrasing and is thus not well suited to be used in combination with the language generation module which introduces this variability. One method of concatenating speech in which the prosodic properties are taken into account is the word concatenation technique used in a computer-assisted language learning program called Appeal (de Pijper 1997). Here, the words have been recorded embedded in carrier sentences to do justice to the fact that words are shorter and often more reduced when spoken in context. The duration and pitch of the words are adapted to the context using the PSOLA technique (Charpentier and Moulines 1989). This ensures a natural prosody, but the coding algorithm will lead to deterioration of the quality of the output speech.
3. PHRASE CONCATENATION Our approach to concatenating words and phrases requires no manipulation or coding of the recordings, so no loss in quality can occur at that point. A good speech output quality with natural intonation is achieved by using several prosodic variants of otherwise identical words and phrases. To determine which phrases and words have to be recorded and how many different prosodic realizations are required, a thorough analysis of the material to be generated by the system is a necessary phase in the development of a phrase database.
3.1. Analyzing content and prosodic properties The generation of the messages, their content and prosodic properties, is the responsibility of the language generation module. The sentences are generated on the basis of templates in the form of syntactic trees which consist of xed parts (carriers) and variable parts (slots). Usually, a xed phrase that serves as a carrier can be recorded as a whole. Sometimes, it is more convenient to split the carrier into two or more phrases, if parts of the carrier occur in several other carrier sentences as well. The slots in the templates deserve special attention, because there the variable (and usually the most important) information is inserted. For the (slot llers), the computation of the prosodic properties by the language generation module is most important. Two important parameters are controlled, viz. accentuation
and phrasing. Accentuation: A word can be either accented or unaccented. In the text that is enriched with prosodic markers, accented words are marked by a double quote (\). Deaccentuation rules are based on the given-new distinction (van Deemter 1994). As mentioned before, proper accentuation highlights the information structure of an utterance. Deaccentuation is necessary in the OVIS dialogues because accentuating given information gives unnatural results and can even lead to wrong interpretations. For instance, in the implicit veri cation sentence Hoe \laat wilt u naar \Brussel reizen?, `What time do you wish to travel to Brussels', accenting Brussels would imply to the caller that he has listed several destinations and Brussels is just one of them. Phrasing: Three phrase boundary strengths are distinguished. The sentence- nal boundary (indicated by three slashes, ///) is the strongest one. Words which are clause- nal or which precede a punctuation symbol other than a comma are followed by a major boundary (//). A minor boundary (/) precedes a comma and certain syntactic constituents. In longer texts containing more complicated constructions, it might be desirable to distinguish more levels. Sanderman (1996) uses ve levels to achieve more natural phrasing.
3.2. Determining prosodic realizations
Once the content and the prosodic properties of the messages to be made audible are known, a phrase database can be developed. For the slot llers, we chose to use six dierent prosodic realizations, which are depicted in Figure 1. They are stylizations of the most common realizations occurring in each of the contexts. In the Grammar of Dutch Intonation ('t Hart, Collier, and Cohen 1990), prosodic realizations are described in the form of contours, which in turn are made up of pitch movements, either rises or falls. The dierent pitch contours needed are explained below. BOUNDARIES A C C E N T S
NONE
MINOR / MAJOR
FINAL
1
2
3
4
5
6
YES
NO
Figure 1: Stylized examples of the pitch contours needed
1. An accented slot ller which does not occur before a phrase boundary is produced with the most frequently used pitch movement, the so-called hat pattern, which consists of a rise and fall on the same syllable. This contour often corresponds to the prosodically neutral version that is used in straightforward concatenation techniques. In some constructions, such as time expressions, the fall is delayed to fall on the last word in the expression. This creates the so-called at hat which in Figure 1 is obtained by combining the rise of (1) with the fall of (3). 2. An accented slot ller which occurs before a minor or a major phrase boundary is most often produced with a rise to mark the accent and an additional continuation rise to signal that there is a non- nal boundary. A short pause follows the constituent. 3. An accented slot ller which occurs in nal position receives a nal fall. A longer pause follows. This contour co-occurs with a rise in a preceding word. 4. Unaccented slot llers are pronounced in a neutral fashion without any pitch movement associated to them. 5. Unaccented slot llers occurring before a minor or a major phrase boundary only receive a small continuation rise. This type of words does not occur very often in the OVIS domain. The language generation module usually puts a minor or major phrase boundary immediately after an accented word. 6. Unaccented slot llers in a nal position are produced with nal lowering. When recording the material for the phrase database, the slots in the carrier sentences are lled with dummy words so that the xed phrases to be stored in the database can be excised easily. Coarticulation is never a problem. Fade-in and fade-out has been applied to all material in the phrase database to avoid clicks in concatenation. Besides, the slot llers are surrounded by 50 ms pauses, which are not clearly audible, but make the speech sound less hasty. The slot llers such as station names and time and date expressions are embedded in dummy sentences that provide the right prosodic context. The sentences are constructed in such a way as to make the speaker produce the right prosodic realization naturally. The speaker receives no instructions as to how to produce the sentences. The
intonation in the xed phrases is not so critical, so the speaker may use his own intuitions to determine how to pronounce them.
3.3. Generating speech To concatenate the proper words and phrases, an algorithm has been created that performs a mapping between the enriched text, i.e. text with accentuation and phrasing markers, that is output of the language generation module and the phrases that have to be selected. The dierent prosodic variants are selected on the basis of the prosodic markers. The algorithm recursively looks for the largest phrases to concatenate into sentences. As an example, consider the sentence in Figure 2 (English: `On which day would you like to travel from Groningen to Paris?'). The sentence consists of 5 carrier phrases op welke dag, wilt u, van, naar and reizen?. The two slot- lling station names Groningen and Parijs are both accented but Groningen is realized with a continuation rise because of the minor phrase boundary following it.
op "welke "dag wilt u van
"groningen /
naar
"parijs
reizen? ///
Figure 2: Example sentence with two dierent prosodic realizations of a station name (groningen, parijs)
4. CONCLUSION The output quality of the concatenated speech approaches that of natural speech. This is due to the fact that the phrases and words have been recorded taking phrasing and accentuation of the utterances into account. Informal evaluation shows highly satisfactory results, which is supported by the fact that in the latest version of the German train timetable information system (Aust et al. 1995), the essence of our approach has been successfully implemented. This approach is only suitable for applications where the vocabulary is relatively stable, so that recordings
only have to be made once. With a dierent system called GoalGetter that generates spoken reports on soccer matches (Klabbers, Odijk, de Pijper, and Theune 1996), we experienced that it is dicult to keep the system up-to-date, as soccer players come and go every season. At the moment, there is one drawback to our approach. Constructing the recording script and excising all necessary words and phrases manually is quite timeconsuming. Automatizing the segmentation task is possible, but it takes quite some eort to implement. Once automatic segmentation is implemented, quick prototyping will be possible.
References Aust, H., M. Oerder, F. Seide, and V. Steinbiss (1995). The Philips automatic train timetable information system. Speech Communication 17, 249{262. Charpentier, F. and E. Moulines (1989). Pitchsynchronous waveform processing techniques for text-to-speech synthesis using diphones. In Proceedings EUROSPEECH'89, Paris, France, Volume 2, pp. 13{19. de Pijper, J. (1997). High quality message-to-speech generation in a practical application. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis, pp. 575{586. New York: Springer-Verlag. Klabbers, E., J. Odijk, J. de Pijper, and M. Theune (1996). GoalGetter: From Teletext to Speech. In IPO Annual Progress Report, Volume 31, pp. 66{ 75. Rietveld, T., J. Kerkho, M. Emons, E. Meijer, A. Sanderman, and A. Sluijter (1997). Evaluation of speech synthesis systems for Dutch in telecommunication applications in GSM and PSTN networks. To appear in EUROSPEECH'97, Rhodes, Greece. Sanderman, A. (1996). Prosodic Phrasing: production, perception, acceptability and comprehension. Ph. D. thesis, Eindhoven University, Eindhoven. 't Hart, J., R. Collier, and A. Cohen (1990). A perceptual study of intonation: an experimentalphonetic approach to speech melody. Cambridge: Cambridge University Press. van Deemter, K. (1994). What's new? A semantic perspective on sentence accent. Journal of Semantics 11, 1{31.