The First Spoken Slovak Dialogue System Using Corpus ... - CiteSeerX

The First Spoken Slovak Dialogue System Using Corpus Based TTS Gregor Rozinaj, Andrej Páleník, Jozef Čepko, Jozef Juhár1, Anton Čižmár1, Milan Rusko2, Roman Jarina3 Department of Telecommunications Faculty of Electrical Engineering and Information Technology Slovak University of Technology in Bratislava Ilkovičova 3, 812 19 Bratislava 1, Slovak Republic Phone: (421) 02 602 91 111 Fax: (421) 02 654 20 415 E-mail: [email protected], [email protected], [email protected]

Keywords: TTS, corpus-based speech synthesis, SLDS, spoken language dialogue system

Abstract - Research and development of the first Slovak spoken language dialogue system with emphasis on the TTS module is described in this paper. The first part of the paper describes the architecture of the system. It is based on the DARPA Communicator architecture. It consists of the Galaxy hub, telephony, automatic speech recognition (ASR), text-tospeech (TTS), backend, transport and VoiceXML dialogue management modules. It enables multi-user interaction in Slovak language. Its functionality is demonstrated and tested via two pilot applications, „Weather forecast for Slovakia“ and „Timetable of Slovak Railways“. Required information can be retrieved from Internet resources in multi-user mode through PSTN, ISDN, GSM and/or VoIP network. The second part of this paper is focused on the text-to-speech module in more detail. Presented are considerations and experiences that we gathered during the development of the speech synthesizer and design of the speech database.

1. INTRODUCTION Recent progress in the field of human speech recognition, understanding and text-to-speech synthesis makes new kinds of human-machine interaction possible; interaction, that is based on human speech as a carrier of communication. Many possible applications emerge. In this paper we present IR 1 KR - the first Slovak spoken language dialogue system (SLDS). It is the outcome of a triennial project funded by the Ministry of Education of the Slovak Republic. Four partners joined together with this project: Technical University of Košice, Slovak University of Technology in Bratislava, University of Žilina and Institute of Informatics of Slovak Academy of Sciences – Bratislava. The main goal of the project was research and development of a SLDS for information retrieval using voice interaction between human and computer. The developed system offers multi-user, multi-platform, dialogue-driven voice interface for information retrieval through telecommunication networks and/or IP networks. The functionality of the system was demonstrated and tested via two pilot applications – “Weather forecast for 1

Technical University of Košice, Košice, Slovakia, email: [email protected], [email protected] 2 Slovak Academy of Science, Bratislava, Slovakia, email: [email protected] 3 University of Žilina, Žilina, Slovakia, email: [email protected]

Slovakia” and “Timetable of Slovak Railways” retrieving information from internet resources in multi-user mode through PSTN, ISDN, GSM and VoIP network. The system has “hub-and-spoke” modular architecture which will be described in more detail in chapter 2. The advantage of this is that each module could be developed independently keeping agreed upon interface for communication between modules and hub. The module that will be focused on in this paper is the TTS module which was partly developed at our department. TTS stands for text-to-speech which is a conversion of text into speech signal. This is very important part of the system because it is responsible for rendering the communication statement of the machine into a human-understandable form. Whereas one research group solves the problem of recognizing and understanding the speech by the computer, our main goal is to make computer talk so that the people understand it. Speech synthesis has long and exciting history. The first documented attempt to imitate human speech was conducted by Bratislava-native J.W. von Kempelen. Since then there have been many other approaches used. However the first breakthrough was the rise of computer era in the second half of the 20th century. The second breakthrough was the beginning of the 90ties in the 20th century when the speed and capacity of computers enabled the management of large speech databases and use of computationally expensive algorithms. Nowadays, methods based on concatenation of speech units prevail. In presented system two different approaches to speech synthesis from the “concatenation” domain are implemented: diphone synthesizer and corpus-based synthesizer. Each has its advantages but the corpus-based synthesizer has better perspective for further improvements. This synthesizer is presented in chapter 3. Finally the chapter 4 brings some conclusions and options for further research are stated as for the TTS, for the whole system development as well.

2. IRKR SYSTEM ARCHITECTURE Architecture of the developed system – IRKR is based on DARPA Communicator [14]. DARPA Communicator systems use a ‘hub-and-spoke’ architecture: each module seeks services from and provides services to the other modules by communicating with them through a central

software router, the Galaxy hub. Java, along with C and C++ is supported in the API to the Galaxy hub. The system consists of hub and six system modules: telephony module, automatic speech recognition (ASR) module, text-to-speech (TTS) module, transport module, back/end module and module of dialogue management. The relationship between the dialogue manager, the Galaxy hub, and the other system modules is represented schematically in Fig. 1. Telecom network Telephony server

Telephone

Information server

ASR server

INTERNET

HUB TTS server

Dialogue Manager VoiceXML

Fig. 1.

Architecture of the Galaxy/VoiceXML based spoken Slovak dialogue system

1.1. Telephony module Telephony (audio) server connects the whole system to telecommunication network. A direct (broker) connection between audio server and ASR server or TTS server is established to transmit speech data and this way to reduce the network load. The telephony server is written in C++ and it supports telephone hardware (Dialogic D120/41JCTLSEuro). 1.2. Automatic Speech Recognition (ASR) Development of reliable and fast speech recognizer is not an easy task. Fortunately there are several speech recognizers available for nonprofit research. We have adapted two well-known speech recognizers as ASR module for our system. The first-one is ATK, on-line version of HTK [5]. The ATK based ASR module was adapted for our SLDS running on Windows-only platform and on mixed Windows/Linux platform as well. In the second case the ASR module is running on separate PC with Linux OS. The second speech recognizer we adapted for our system is Sphinx-4 written in Java [12]. Both ASR modules provide similar results. SpeechDat-SK [8] and MobilDat-SK [9] databases were used for training HMM’s. Context dependent (triphone) acoustic models were trained in a training procedure compatible with “refrec” [6]. Dynamic speech recognition grammars and lexicons are used in speech recognizers. The MobilDat-SK is a speech database containing recordings of 1100 speakers recorded over mobile (GSM) telephone network. It serves as an extension to the SpeechDat-E Slovak database [8] and so it was designed to follow the SpeechDat specification [7] as closely as possible. It is balanced according to the age, regional accent, and gender of the speakers, type of background noise. Every speaker pronounced 50 files (either prompted or spontaneous) containing numbers, names, dates, money amounts, embedded command words, geographical names, phonetically balanced words, phonetically balanced sentences, Yes/No answers and one longer non-mandatory spontaneous utterance.

1.3. Dialogue manage (DM) There are many approaches how to solve dialogue manager unit and also many languages for writing a code of it. Voice Extensible Markup Language (VoiceXML) is a markup language for creating voice user interfaces that uses automatic speech recognition (ASR) and text-tospeech synthesis (TTS). VoiceXML simplifies speech application development, enables distributed application design and accelerates development of interactive voice response (IVR) environments. Last but not least, it is widespread in many commercial applications across diverse set of industries. For these reasons we decided the dialog manager unit to be based on VoiceXML interpretation. Fundamental components of the DM module are VoiceXML interpreter, XML Parser and ECMAScript unit. The VoiceXML interpreter interprets VoiceXML commands. They are not interpreted sequentially top down, but their executing is exactly described by VoiceXML algorithms: Document Loop Algorithm, Form Interpretation Algorithm, Grammar Activation Algorithm, Event Handling Algorithm and Resource Fetching Algorithm. XML Parser creates hierarchical image of VoiceXML application from the VoiceXML code and parses XML grammars and XML content from internet locations. ECMAScript unit is important to support element of VoiceXML and enables writing of ECMAScript code in VoiceXML. The unit is based on SpiderMonkey (JavaScript-C) engine. The dialogue manager is written in C++. The interpreter performs all fundamental algorithms of VoiceXML language and service functions for all VoiceXML commands. It supports full range of VoiceXML 1.0 functions. The goal for the future is full support of VoiceXML v2.0. 1.4. Back-end module After an analysis of various existing approaches of information retrieval from the web, and of the task to be carried out by the information server in our pilot applications, we came to the decision that no complicated data retrieval system is needed. On the contrary – a rule based ad-hoc application searching only several predefined web-servers with relatively well known structure of pages will do a much better work. As the number of web-servers giving detailed weather forecast for Slovakia, as well as the number of web-servers providing information on train and bus connections in Slovakia is very limited, we had been checking several selected servers for stability and information reliability for about a month, and then we have chosen several candidate servers, from which the information is being retrieved. The information server (backend server) is capable of retrieving the information contained on the suitable webpages according to the Dialogue Manager (DM) requests, extracting the needed data, analyzing them and if they are taken for valid, it returns the data in the XML format to the DM. If the backend server fails to get valid data from one web source, it switches to a second wrapper retrieving the information from a different web-server. The web-wrapper is responsible for the navigation through the web-server, data extraction from the web-

pages and their mapping on to a structured format (XML), convenient for further processing. In most cases the wrapper is specially designed for one source of data; thus to combine data from different sources, several wrappers must be designed. Wrappers are designed to be as robust against changes in the web-pages structure as possible. Nevertheless, in a case of substantial change of the webpage design, adaptation of the wrapper would probably be inevitable. To speed up the system (to eliminate the influence of long reaction times of the www-pages) and to assure dropout resistance with simultaneous keeping the information as current as possible, automatic periodic download and cashing of the web-pages content were introduced. The server is open for future applications by the possibility of creating web-wrappers for new services and adding them to the existing wrappers for Weather Forecast and Train Timetable. The information on the currently accessible wrappers is stored in the system’s configuration file.

3. TTS TTS plays a very important role in a SLDS. The voice that the TTS produces directly interferes with the only sense that the SLDS makes active – the hearing. On the quality of the synthesized speech stands or falls the complete impression from the SLDS. The synthesized speech is not just an output from the TTS module, but it stands for all the processes that are realized in all of the modules since the user’s input. Synthesis of poor quality considerably diminishes the probability of the SLDS to be used again. The quality of speech synthesis is measured by two factors: its intelligibility and naturalness. For the proper and satisfying functioning of the SLDS a TTS producing an intelligible speech is doing enough. However, the trends in speech synthesis of the last two decades head towards suppressing the robotic-style speech. The main challenge in TTS within a SLDS is then the trade-off between the intelligibility and the naturalness. An intelligible speech can be produced with an ease by a diphone-like synthesizer, however, this type of synthesizer is not capable of making the speech sound natural. A human style synthetic speech may be achieved by a corpus based synthesis, which on the other hand, has problems with intelligibility. Nevertheless, in contrary to naturalness and diphone-style synthesis, these problems are solvable. One of the ways of solving this matter is aiming the corpus of the synthesizer at the domain that the SLDS is to be used. Especially at SLDSs is such approach explicitly offered, since not just the application, but the complete dialogue may be known ahead. Unfortunately that brings up another trade-off – the trade-off between the intelligibility and the universality of the TTS. To make it absurd: corpus made up of the whole sentences from the intended dialogue results in the best synthesis, but each time a SLDS provider changes it, the need of recording new sentences is inevitable. The only chance to avoid recording is to have all of possible sentences from the dialogue domain in the corpus, which’s Utopia of course. In our synthesizer we tried to cover all of the possible domain words. Doing a lot of data mining, conjugating,

declining and doing other grammatical activities with the words we finally gathered a rather big set of words. Since we did not want to record these words separately but within phrases, which we needed to artificially generate, we had to shrink the set. The only possibility was to step down one level – covering all of the syllables from the set of words. A greedy algorithm was used to sort out the best covering words. Finally these words were manually joined into phrases that were recorded [1]. Still, having most of the syllables covered this way, the universality of such a TTS was not granted. A smaller basic speech unit must have been established. A diphone showed as a good candidate even at this type of synthesis [2]. We implemented two unit selection algorithms that create final sequences of diphones for the synthesis [3]. Not having the module for high level synthesis completed yet, both of the algorithms use just one priority – minimum number of artificial concatenations in the synthetic signal. In other words, combinations of the longest units are preferred. The first algorithm propagates from the longest units to the shortest ones, first querying the corpus for the whole phrase, then for its’ sub-phrases and finally for the diphones. The second algorithm starts from the shortest units. It just queries the corpus for all of the diphones which then puts into a lattice of candidates. Finally Viterbi search is used to find the path through the lattice with minimum number of concatenations. Planning to work with more parameters than number of concatenations in the future, we eventually used the second algorithm in the synthesizer. It’s so, because of the fact, that contrary to the first algorithm, the second one is capable of paralell evaluation of multiple parameters, which when weighted properly, an optimal sequence of speech units with regard to the target specification may be achieved. The examination of the corpus showed that not all types of diphones were present in the database. Wanting the TTS to be able to synthesize any input, not just the intended domain dialogue, the unit selection algorithm was enriched by a method of substituting missing diphones by phonemes. The method introduces 5 types of basic synthesis’ units, which it combines in the final target specification for the selection algorithm [4]. The advantage of using the defined types is that some missing diphones may be substituted by expansion of borders of neighboring diphones. This may save couple of concatenations and so enhance the synthesis quality, particularly at words’ and phrases’ boundaries. To select units from the corpus, all of the phonemes and diphones in the database must be labeled. The correctness of this labeling process heavily influences the quality of the synthetic speech and so it is the crucial part of the corpus design. Since the recorded corpus is of one hour length, we quit from any attempts of manual labeling. We used automatic procedure within the ATK system instead, which was based on trained triphone HMMs obtained from the recorded corpus. One of the drawbacks of the realized corpus based synthesis is its long response time when synthesizing long phrases. That’s why we were forced to implement a cache for these. Each long phrase target specification is first looked up in the cache, if it’s found there no synthesis takes place, just samples are copied from the cache file. Particularly for a SLDS, where a lot of phrases from a

dialogue often repeat, is such a synthesis speed up more than suitable. The flow of TTS process within our SLDS may be seen at fig. 2. It starts with the request for the synthesis coming from the HUB server. The request contains the text to be synthesized. First, the text is processed by a module of high level synthesis which substitutes all non-standard characters and transcripts the text to phonetic notation. The phonetic transcription is done either by dictionary or data driven rules. The phonetic notation (SAMPA alphabet) is the target synthesis specification for the low level synthesis module. So far the target specification is simple, but in the nearby future we will add all of the parameters [4] that the unit selection algorithm along with the corpus annotation support. In the actual version of low level synthesis module the best speech unit sequence is given by minimum number of concatenations only. In case of concatenations equality, the lowest distortion of F0 or CLPC coefficients at the concatenations decides. Finally, the synthesized signal is stored to an audio file and via a broker channel sent to the audio server that plays it to the user. IRKR SLDS

1. text HUB

IRKR - GALAXY wrapper

2. text 4. SAMPA parameters

high level synthesis

[1]

[2]

[3]

6. samples from the wav file

5. name of the wav file 3. SAMPA parameters

REFERENCES

[4]

7. samples from the wav file

audio server

young, the synthesized speech for a limited domain sounds promising. The possibility of further development of SLDS with the use of a corpus synthesizer is tested via two pilot applications, „Weather forecast for Slovakia“ and „Timetable of Slovak Railways“.

audio file

[5] [6]

low level synthesis processing of numerals and abbreviations

[7] corpus / SDB

[8] dictionary

data driven based phonetic transcription

[9] Fig. 2. TTS process. Numbers at the arrows denotes order of communication. [10]

4. ACKNOWLEDGMENT This work was supported by Slovak Grant Agency VEGA under grants No. 1/3110/06 and the Ministry of Education of Slovak Republic under research project No. 2003 SP 20 028 01 03.

[11]

[12]

5. CONCLUSION The goal of this paper was to present the first SLDS realized for Slovak language, in which we successfully joined state-of-the-art free resources with our own research into functional system. We also shared the experience of using a corpus based speech synthesis in its TTS module. Although that this kind of synthesis for Slovak is very

[13]

[14]

Páleník, A., Čepko, J., Rozinaj, G., "Design of Slovak Speech Database for Limited Domain TTS within an IVR System", 48th International Symposium ELMAR-2006 focused on Multimedia Signal Processing and Communications, 07-09 June 2006, Zadar, Croatia. Conkie, A., D., "Robust Unit Selection System for Speech Synthesis", Joint Meeting of ASA, EAA, and DAGA, paper 1PSCB_10: Berlin, Germany, 15.-19.3.1999. Čepko, J., Rozinaj, G., "Corpus synthesis of Slovak speech", Proc. of the 46th International Symposium Electronics in Marine – ELMAR 2004, pp.313-318, Zadar, Croatia. 2004. Čepko, J., Turi Nagy, M., Rozinaj, G., "Low-Level Synthesis of Slovak Speech in S2 Synthesizer", 5th EURASIP Conference focused on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic. 2005. S. Young, “ATK: An application Toolkit for HTK, version 1.3”, Cambridge University, January 2004. S. Lihan, J. Juhár and A. Čižmár, “Crosslingual and Bilingual Speech Recognition with Slovak and Czech SpeechDat-E Databases”, In Proc. Interspeech 2005, Lisabon, Portugal, September 2005, pp. 225 – 228. R. Winski, “Definition of corpus, scripts and standards for fixed networks”, Technical report, SpeechDat-II, Deliverable SD1.1.1., workpackage WP1, January 1997. P. Pollak, J. Cernocky, J. Boudy, K. Choukri, M. Rusko, M. Trnka, et. al. “SpeechDat(E) „Eastern European Telephone Speech Databases“, in Proc. LREC 2000 Satellite workshop XLDB - Very large Telephone Speech Databases, Athens, Greece, May 2000, pp. 20-25. M. Rusko, M. Trnka and S. Darjaa, “MobilDat-SK - A Mobile Telephone Extension to the SpeechDat-E SK Telephone Speech Database in Slovak”, in press: SPEECOM 2006, Sankt Peterburg, Russia, July 2006. S. Darjaa, M. Rusko, M. Trnka, “Three Generations of Speech Synthesis Systems in Slovakia”, in press: SPEECOM 2006, Sankt Peterburg, Russia, July 2006. I. Gladišová, Ľ. Doboš, J. Juhár and S. Ondáš, “Dialog Design for Telephone Based Meteorological Information System”, in Proc. DSP-MCOM 2005, Košice, Slovakia, Sept., 2005, pp. 151-154. M. Mirilovič, S. Lihan, J. Juhár and A. Čižmár, “Slovak speech recognition based on Sphinx-4 and SpeechDat-SK”, in Proc. DSP-MCOM 2005, Košice, Slovakia, Sept. 2005, pp. 76-79. S. Ondáš and J. Juhár, “Dialogue manager based on the VoiceXML interpreter”, in Proc. DSP-MCOM 2005, Košice, Slovakia, Sept. 2005, pp.80-83. http://www.sls.csail.mit.edu/sls/technologies/galaxy.shtml