Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

3 downloads 2015 Views 182KB Size Report
Using TTS for Fast Prototyping of Cross-Lingual ASR. Applications. Jan Nouza ... development of speech recognition applications that are to be ported to other languages. .... The software can be tested via its web pages. [12]. Any text typed in ...
Using TTS for Fast Prototyping of Cross-Lingual ASR Applications Jan Nouza and Marek Boháč Institute of Information Technology and Electronics, Technical University of Liberec Studentská 2, 461 17 Liberec, Czech Republic {jan.nouza,marek.bohac}@tul.cz

Abstract. In the paper we propose a method that simplifies initial stages in the development of speech recognition applications that are to be ported to other languages. The method is based on cross-lingual adaptation of the acoustic model. In the search for optimal mapping between the target and original phonetic inventories we utilize data generated in the target language by a highquality TTS system. The data is analyzed by an ASR module that serves as a partly restricted phoneme recognizer. We demonstrate the method on Czech-toPolish adaptation of two prototype systems, one aimed at handicapped persons and another prepared for fluent dictation with large vocabulary. Keywords: Speech recognition, speech synthesis, cross-lingual adaptation.

1 Introduction As the number and variety of voice technology applications increases, the demand to port them into other languages becomes acute. One of the crucial issues in localization of the already developed products for another language is the cost of the transfer. In ASR (Automatic Speech Recognition) systems, the major costs are related to the adaptation of their two language-dependent layers: the acoustic-phonetic one and the linguistic one. Usually, the latter task is easier for automation because it is based on statistical processing of texts, which are now widely available in digital form (e.g. on internet [1]). The former task takes significantly more human work since it requires a large amount of annotated speech recordings and some deeper phonetic knowledge. These costs may be prohibitive if we aim at porting applications for special groups of clients, such as handicapped persons, where the number of potential users is small and the price of the products should be kept low. The research described in this paper has had three major goals. First, we were asked to transfer the voice tools developed for Czech handicapped persons to the similar target groups in the countries where these tools are not available. Second, we wanted to find a methodology that would make the transfer as rapid and cheap as possible. And third, we wished to explore the limits of the proposed approach to see whether it is applicable also for more challenging tasks. Our initial attempt was to allow for porting the MyVoice and MyDictate tools into other (mainly Slavic) languages. The two programs were developed in our lab in 2004 to 2006. They enabled Czech motor-handicapped people to work with a PC in a A. Esposito et al. (Eds.): Communication and Enactment 2010, LNCS 6800, pp. 154–162, 2011. © Springer-Verlag Berlin Heidelberg 2011

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

155

hands-free manner, with a large degree of flexibility and customization [2]. Very soon, a demand to port the MyVoice for Slovak language occurred and a few years later the software was transferred also to Spanish. The adaptation of the acousticphonetic layer of the MyVoice’s engine was done in a simple and straightforward way – by mapping the phonemes of the target language to the original Czech ones [3]. In both the cases, the mapping was conducted by experts who knew the phonetics of the target and original languages. As the demand for porting the voice tools to several other languages increases we are searching for an alternative approach in which the expert can be (at least partly) replaced by a machine. In this paper, we investigate a method where a TTS system together with an ASRbased tool tries to play the role of a ‘skilled phonetician’ whose aim is to find the optimal acoustic-phonetic mapping. The approach has been proposed and successfully tested on Polish language. Our experiments show that the scheme yields promising results not only for small-vocabulary applications but also for a more challenging task, such as fluent dictation of professional texts. In the following sections, we briefly introduce the ASR systems developed for Czech. Then we focus on the issues related to their transfer to Polish. We mention the main differences occurring on the phonetic level between the two languages and propose a method that utilizes the output from a Polish TTS system for creating the objective mapping of Polish orthography to Czech phonetic inventory. The proposed solution is simple and cheap because it does not require human-made recordings or the expert in phonetics and yet it seems applicable in the desired area.

2 ASR Systems Developed for Czech Language During the last decade we have developed two types of ASR engines, one for voicecommand input and discrete-speech dictation and another for fluent speech recognition with very large vocabularies. The former proved to be useful mainly in applications where robust hands-free performance is the highest priority. This is the case, for example, of voice-controlled aids developed for motor-handicapped people. Voice commands and voice typing can help them very much if it is reliable, flexible, customizable and does not require highcost computing power. The speed of typing, on the other side, is slower but this is not the crucial factor. The engine we have developed can operate with small vocabularies (tens or hundreds of commands) as well as with very large lexicons (up to 1 million words). It has been recently used in the MyVoice tool and in the MyDictate program. Both the programs can run not only on low-cost PCs but also on mobile devices [4]. The latter engine is a large-vocabulary continuous-speech recognition (LVCSR) decoder. It has been developed for voice dictation and speech transcription tasks with regard to specific needs of highly inflected languages, like Czech and other Slavic languages [5]. The recent version operates in real time with lexicons whose size goes up to 500 thousands words. Both the engines use the same signal processing and acoustic modeling core. A speech signal is 16 kHz sampled and parameterized every 10 ms into 39 MFCC features per frame. The acoustic model (AM) employs CDHMMs that are based either on context-independent phonetic units (monophones) or context-dependent triphones. The latter yield slightly better performance though the former are more compact,

156

J. Nouza and M. Boháč

require less memory and they are more robust against pronunciation deviations. The last aspect is important especially if we consider using the AM in speech recognition of another language. The linguistic part of the systems consists in the lexicon (which can be general or application oriented) and the corresponding language model (LM). The LM used in simpler systems has form of fixed grammar, in the dictation and transcription systems it is based on bigrams. The final applications (e.g. the programs MyVoice, MyDictate and FluentDictate) have been developed for Czech. Yet, the engines themselves are languageindependent. If the above programs are to be used in another language, we need to provide them with a new lexicon, a corresponding LM and an AM that fit the coding used for the pronunciation part of the lexicon

3 Case Study: Adapting ASR System to Polish Language In the following part, we present a method that allows us to adapt the acoustic model of an ASR system to a new language. Its main benefit consists in the fact that only minimum amount of speech data needs to be recorded and annotated for the target language. Instead of recording human-produced speech we employ data generated by a high-quality TTS system. Then, we analyze it by an ASR system in order find the optimal mapping between the phonemes of the target language and the phonetic inventory of the original acoustic model. Moreover, the ASR system serves as an automatic transducer that proposes and evaluates the rules for transcribing the orthographic form of words in the target language into pronunciation forms based on phonemes of the original languages. It is evident that the AM created for the new language using the above approach cannot perform as well as the AM that was trained directly on the target language data. Therefore, we need to evaluate how good this adapted AM is. For this purpose we utilize the TTS again. This time we employ it as a generator of test data, perform speech recognition tests and compare the results with those achieved for the same utterances produced by human speakers. In the next sections, we will illustrate the method on a case study in which two Czech ASR systems have been adapted for Polish language. 3.1 Czech vs. Polish Phonology Czech and Polish belong to the same West branch of Slavic languages, however, they differ significantly on lexical as well as on phonetic level. The phonetic inventory of Czech consists of 10 wovels (5 short and 5 long ones + very rare schwa) and 30 consonants. All are listed in Table 1 where each phoneme is represented by its SAMPA symbol [6]. (In this text, we prefer to use SAMPA notation rather than IPA because it is easier for typing and reading.) Polish phonology [8] recognizes 8 vowels and 29 consonants. Their list with SAMPA symbols [9] is in Table 2. By comparing the two tables we can see that there are 3 vowels (I, e~, o~) and 5 consonants (ts', dz', s',z', w) that are specific for Polish. All the other phonemes have their counterparts in Czech. (Note that symbol n' used in Polish SAMPA is equivalent to J in Czech SAMPA.)

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

157

Table 1. Czech phonetic inventory Groups Vowels (11) Consonants (30)

SAMPA symbols a, e, i, o, u, a:, e:, i:, o:, u:, @ (schwa) p, b, t, d, c, J\, k, g, ts, dz, tS,dZ, f, v, s, z, S, Z, X, h\, Q\, P\ j, r, l, m, n, N, J, F Table 2. Polish phonetic inventory

Groups Vowels (8) Consonants (29)

SAMPA symbols a, e, i, o, u, I, e~, o~ p, b, t, d, k, g, ts, dz, tS, dZ, f, v, s, z, S, Z, X, ts', dz', s',z', w, j, r, l, m, n, N, n' (equivalent to Czech J)

3.2 How to Map Polish Phonemes to Czech Phoneme Inventory? In our previous research on cross-lingual adaptation, we have done transfer of a Czech ASR system to Slovak [10] and to Spanish [3]. In both the cases, the Czech acoustic model was used and the language specific phonemes were mapped to the Czech ones. The mapping was designed by the experts who knew the original and the target language. An alternative to this expert-driven method is a data-driven approach, e.g. that described in [11] where the similarity between phonemes in two languages is measured by Bhattacharyya distance. However, this method requires quite a lot of recorded and annotated data in both the languages. In this paper, we propose an approach where the data from the target language are generated by a TTS system and the mapping is controlled by an ASR system. The main advantage is that the data can be produced automatically, on demand and in the amount and structure that is needed. 3.3 Phonetic Mapping Based on TTS Output Analyzed by ASR System The key component is a high-quality TTS system. For Polish language, we have chosen the IVONA software [12]. It employs the algorithm that produces an almost natural speech by concatenating properly selected units from a large database of recordings. The software won several awards in TTS competitions [13, 14]. Recently, it offers 4 different voices (2 male and 2 female), which - for our purpose - introduces an additional degree of voice variety. The software can be tested via its web pages [12]. Any text typed in its input box is immediately converted into an utterance. The second component is an ASR system operating with the given acoustic model (the Czech one in this case). It is arranged in the way that it works as a partly restricted phoneme recognizer. The ASR module takes a recording, transforms it into a series of feature vectors X = x(1), ...x(t), ...x(T) and outputs the most probable sequence of phonemes p1, p2, … pN. The output includes the phonemes, their times and their likelihoods. The module is called with several parameters, as shown in the example below:

158

J. Nouza and M. Boháč

Recording_name: Recorded_utterance: Pronunciation: Variants:

maslo-Ewa.wav masło mas?o ?= u l

uv

In the above example, the recognizer takes the given sound file, processes it and evaluates which of the proposed pronunciations is best. The output looks like this: 1. masuo 2. masuvo 3. maslo -

avg. likelihood = -77.417 avg. likelihood = -77.956 avg. likelihood = -78.213

We can see that for the given recording and the given AM, it is Czech phoneme ‘u’ that fits best to Polish letter ‘ł‘ (and corresponding phoneme ‘w’). The module also provides rich information from the phonetic decoding process (phoneme boundaries, likelihoods in frames, etc.), which can be used for detailed study, as shown in Fig.1. Ewa

-40

-60 likelihood

likelihood

-60

-80

-100 -

ma

s

20

? 40 frames Jan

-40

o

-

?

o

60 80 20 40 masło _ _ _ ?=l __ __ ?=uv _____ ?=u frames Maja -40

60

m

a

s

80

-60 likelihood

likelihood

-80

-100 -

-60

-80

-100 -

Jacek

-40

m 20

a

s

?

40 frames

o 60

-80

-100 80

ma 20

s

?

40 60 frames

o80

Fig. 1. Diagrams showing log likelihoods in frames of speech generated by TTS system (voices Ewa, Jacek, Jan, Maja). Different pronunciation variants of Polish letter ‘ł‘ in word ‘masło‘ can be compared.

Using the TTS software we have recorded more than 50 Polish words, each spoken by four available voices. The words were selected so that all the Polish specific phonemes occurred at various positions and context (e.g. at the start, in the middle, at

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

159

the end of words, in specific phonetic clusters, etc.). For each word, we offered the phoneme recognizer several pronunciation alternatives to choose from. In most cases, the output from the recognizer was consistent in the way that the same best phonemes were assigned to the Polish ones across various words and the four TTS voices. In some cases, however, the mapping showed to be dependent on the context, e.g. Polish ‘rz’ was mapped either to Czech phonemes ‘Z’, ‘Q\’ or ‘P\’. The results are summarized in Table 3. We can see that the resulting map covers not only the phoneme-to-phoneme relations but also the grapheme-to-phoneme conversion. It is also interesting to compare these objectively derived mappings with those considered in subjective perception. Since Poland and the Czech Republic are neighboring countries, Czech people have a lot of chances to hear spoken Polish and to use some Polish words, such as proper and geographical names. The subjective perception of some Polish specific phonemes seems to be different from what has been found by the objective investigation. For example, Czech people tend to perceive Polish ‘I’ as ‘i’ (the reason being that the letters ‘i’ and ‘y’ are pronounced in the same way in Czech – as ‘i’). Also Polish pair of letters ‘rz’ is usually considered as equivalent to Czech ‘ř’, which is not always true. The above described method proves that the ASR machine (equipped with the given acoustic model) has different perception. Anyway, this perception is objective because it is the ASR system that is to perform the recognition task. Table 3. Polish orthography and phonemes mapped to Czech phonetic inventory

Letter(s) in Polish orthography y ó ę ą dz ź / z(i) ś / s(i) dź / dz(i) ć / c(i) ż rz sz dż cz ń / n(i) h, ch ł

Polish phoneme(s) (SAMPA) I u e~ o~ dz z' s' dz' ts' Z Z S dZ tS n' X w

Mapping to Czech phoneme(s) (SAMPA) e, (schwa) u e+n, (e+N) o+n, (o+N) dz Z S dZ tS Z Z (Q\ or P\ in clusters trz, drz) S dZ tS J X u

160

J. Nouza and M. Boháč

3.4 Evaluation on Small Vocabulary Task The first task, in which we tested the proposed method and evaluated the resulting mapping, was Polish voice-command control, the same as in the MyVoice tool. The basic lexicon in this application consists of 256 commands, such as names of letters, digits, keys on PC keyboard, mouse actions, names of computer programs, etc. These commands have been translated into Polish, their pronunciations have been created automatically using the rules in Table 3 and after that they were recorded by the IVONA TTS (all the four voices) and by two Polish speakers. All the recordings were passed to the MyVoice’s ASR module operating with the original Czech AM. The experiment was to show us how well this cross-lingual application performed, and whether there is a significant difference in recognition of synthetic and human speech. Also it allowed us to compare the objectively derived mapping with the subjective phoneme conversion mentioned in section 3.3. The results are included in Table 4. We can observe that the performance measured by the Word Recognition Rate (WRR) is considerably high both for the TTS data as well as for human speakers. The results are comparable to those achieved for Czech, Slovak and Spanish [3]. 3.5 Evaluation on Fluent Speech Dictation with Large Lexicon The second task was to build a very preliminary version of a Polish voice dictation system for radiology. In this case, we used data (articles, annotations, medical reports) available online at [15]. For this purpose, we collected a small corpus (approx. 2 MB) of radiology texts and created a lexicon made of 23.060 most frequent words. Their pronunciations were derived automatically using the rules in Table 3. The bigram language model was computed on the same corpus. To test the prototype system, we selected three medical reports not included in the training corpus. They were recorded again by the IVONA software (four times with four different voices) and by two native speakers. The results from this experiment are part of Table 4. The WRR values are about 8 - 10 % lower compared to the Czech dictation system for radiology but it should be noted that our main aim was to test the proposed fast prototyping technique. The complete design of this demo system took just one week. It is also interesting to compare the results achieved with the TTS data to the human-produced ones. We can see that the TTS speech yielded slightly better recognition rates. It is not surprising as we have already observed this in our previous investigations [16]. In any case, we can see that the TTS utterances can be used during the development process as a cheap source of benchmarking data. Table 4. Results from speech recognition experiments in Polish language

Task Voice commands – TTS data Voice commands – human speech Fluent dictation (radiology) – TTS data Fluent dictation (radiology) – human speech

Lexicon size 256 256 23060 23060

WRR [%] 97.8 96.6 86.4 83.7

Using TTS for Fast Prototyping of Cross-Lingual ASR Applications

161

4 Discussion and Conclusions The results of the two experiments show that the proposed combination of TTS data and ASR-driven mapping is applicable in rapid prototyping of programs that are to be transferred into other languages. The TTS system for the target language should be of high quality, of course, and it is appreciated if it can offer multiple voices. If this is true, we can obtain not only the required L2-L1 phonetic mapping but also the grapheme-to-phoneme conversion table that will help us in generating pronunciations for the lexicon in the target application. Moreover, the TTS system can serve as a cheap source of test data needed for preliminary evaluations. The results we obtained in the first experiment prove that the created lexicon (with its automatically derived pronunciations) could be immediately used in the Polish version of the MyVoice software. Even though the internal acoustic model is Czech, we can expect the overall system performance being at the similar level as it is for Czech users. The most important thing is that during the prototype development no Polish data needed to be recorded and annotated and thus the whole process could be fast and cheap. Furthermore, we showed that the phonetic mapping generated via the combination of TTS and ASR systems would lead to more objective and better results compared to those based on subjective perception. In the second experiment we demonstrated that the same automated approach can be utilized also in a more challenging task, during the initial phase of the development of a dictation system. Within very short time we were able to create a Polish version of the program that can be used for demonstration purposes, for getting potential partners interested and for allowing at least initial testing with future users. Acknowledgments. The research was supported by the Grant Agency of the Czech Republic (grant no. 102/08/0707).

References 1. Vu, N.T., Schlippe, T., Kraus, F., Schultz, T.: Rapid Bootstrapping of five Eastern European Languages using the Rapid Language Adaptation Toolkit. In: Proc. of Interspeech 2010, Japan, Makuhari, pp. 865–868 (2010) 2. Cerva, P., Nouza, J.: Design and Development of Voice Controlled Aids for MotorHandicapped Persons. In: Proc. of Interspeech 2007, Antwerp, pp. 2521–2524 (2007) 3. Callejas, Z., Nouza, J., Cerva, P., López-Cózar, R.: Cost-Efficient Cross-Lingual Adaptation of a Speech Recognition System. In: Advances in Intelligent and Soft Computing, vol. 57, pp. 331–338. Springer, Heidelberg (2009) 4. Nouza, J., Cerva, P., Zdansky, J.: Very Large Vocabulary Voice Dictation for Mobile Devices. In: Proc. of Interspeech 2009, UK, Brighton, pp. 995–998 (2009) 5. Nouza, J., Zdansky, J., Cerva, P., Silovsky, J.: Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces, COST Seminar 2009. LNCS, vol. 5967, pp. 225–241. Springer, Heidelberg (2010) 6. Czech SAMPA, http://noel.feld.cvut.cz/sampa/ 7. Nouza, J., Psutka, J., Uhlir, J.: Phonetic Alphabet for Speech Recognition of Czech. Radioengineering 6(4), 16–20 (1997)

162

J. Nouza and M. Boháč

8. Gussman, E.: The Phonology of Polish. Oxford University Press, Oxford (2007) 9. Polish SAMPA, http://www.phon.ucl.ac.uk/home/sampa/polish.htm 10. Nouza, J., Silovsky, J., Zdansky, J., Cerva, P., Kroul, M., Chaloupka, J.: Czech-to-Slovak Adapted Broadcast News Transcription System. In: Proc. of Interspeech 2008, Australia, Brisbane, pp. 683–2686 (September 2008) 11. Kumar, S.C., Mohandas, V.P., Li, H.: Multilingual Speech Recognition: A Unified Approach. In: Proc. of Interspeech 2005, Portugal, Lisboa, pp. 3357–3360 (2005) 12. IVONA TTS system, http://www.ivona.com/ 13. Kaszczuk, M., Osowski, L.: Evaluating Ivona Speech Synthesis System for Blizzard Challenge 2006. In: Blizzard Workshop, Pittsburgh (2006) 14. Kaszczuk, M., Osowski, L.: The IVO Software Blizzard 2007 Entry: Improving Ivona Speech Synthesis System. In: Sixth ISCA Workshop on Speech Synthesis, Bonn (2007) 15. http://www.openmedica.pl/ 16. Vich, R., Nouza, J., Vondra, M.: Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems. In: Esposito, A., Bourbakis, N.G., Avouris, N., Hatzilygeroudis, I. (eds.) HH and HM Interaction. LNCS (LNAI), vol. 5042, pp. 136–148. Springer, Heidelberg (2008)

Suggest Documents