A Phonetic Lexicon for Adaptation in ASR for Austrian German

0 downloads 0 Views 62KB Size Report
was generated automatically from the canonic version of a. German pronunciation dictionary. The lexicon is based on narrow transcription in Sam-Pa. Both the ...
A Phonetic Lexicon for Adaptation in ASR for Austrian German Micha Baum (1), Rudolf Muhr(2), Gernot Kubin (3) (1) ftw (Vienna Telecommunications Research Center), Austria (2) University of Graz, Austria (3) University of Technology, Graz , Austria [email protected], [email protected], [email protected]

Abstract We present a phonetic lexicon for Austrian German, which was generated automatically from the canonic version of a German pronunciation dictionary. The lexicon is based on narrow transcription in Sam-Pa. Both the speech files and the canonic dictionary are taken from the SpeechDat-AT database. Since the recorded items are mainly read speech the differences between the canonic form and the real pronunciations do not reflect local dialects but are rather influenced from a regional or national standard. There are differences partly due to the extension of the phoneme set, partly due to coexistence of multiple variants.

1. Introduction One big challenge in Automatic Speech Recognition (ASR) is to handle speaker variations in spoken language. These variations can be due to age, gender, educational level, mood and of particular interest, the national and regional variants of a specific pluricentric language. An evaluation study carried out by Philips Speech Processing showed, that the word error rate for Austrian speakers could be reduced from 23% to 16% when Austrian acoustic models (AMOs) were used instead of acoustic models trained with German speakers. Another study carried out by BAS (Bayerisches Archiv f¨ur Sprachsignale) showed that in a perception experiment Austrian speakers could be distinguished well from German or Swiss speakers [1]. The independence of Austrian speakers in terms of perception as well as acoustics indicates a need for work in this research area. There have been studies about pronunciation in Austria before [2] [3] [4] [5]. But all these studies either treat the ”highest level” of spoken language, i.e. a standard language and/or come to terms with very specific phenomena ( e.g. accent in suffixes). Besides they are not based on a big enough corpus which represents the demographic distribution of Austria. In practice the definition and handling of accent regions is an unsolved problem. A previous experiment showed that simply extending the phonetic lexicon to multiple pronunciation did not improve ASR [6]. In another experiment, which failed too, a set of dictionaries containing regionally clustered variants was built [7]. According to these two experiments it is not enough for the improvement of ASR simply to adapt the phonetic lexicon to a specific accent region. This might have been due the low number of speakers, which could have led to stastically unreliable results. It also must be considered, that the experiments were performed with regional variants within Germany but not with national variants of German. It is well known, that especially

national variants yield big phonetic variations [8]. Besides the adaptation on the phonetic level should be combined with other approaches such as generation of AMOs for each defined accent region or considering prosodic features. This paper is about a phonetic lexicon, which was generated on the base of a corpus representing the Austrian population and considers all phonetic environments realised in the avilable data.

2. The Data used for the Transcriptions The SpeechDat-AT database [9] contains 57 items of which 22 were chosen for the phonetic transcriptions. These were isolated digits, numbers, single read words and read sentences. Also items of spontaneous speech werde included to the corpus, in which questions were answered such as: ”From which city do you call?” or ”Please say your first name.”. This resulted all in all in 8474 utterances made by 1511 different speakers which corresponds to a duration of 9 hours and 52 minutes. The speakers pronounced 36607 words of which 7120 were different ones. We selected only speech files, which did not contain any nonspeech symbols such as background noise, speaker noise or bad signal quality. The demographic distribution of the subset fulfills the SpeechDat criteria.

3. The Transcription 3.1. The Transcribers The transcribers were a group of six advanced students of linguistics. Before the actual transcription they had to take a course in phonetics, not only to learn about Sam-Pa transcription (http://www.phon.ucl.ac.uk/home/sampa/german.htm), but also to guarantee consistence of the results. 3.2. The Transcription Tool We used WWWTranscribe which is a transcription tool based on the WWW [10]. We modified the source code in such a way that there were two windows on the screen. One of them shows the orthographic transcription which is taken from the label files of the SpeechDat-AT database. The second window underneath shows a proposal of phonetic transcription. This proposal is based on the dictionary from the SpeechDat-AT database, i.e. for every word of the orthographic transcription in the upper window, the corresponding entry of the dictionary is written in the lower window. All word boundaries are divided by a hash ”#”. Hence this window represents the canonic form. A button with a loudspeaker facilitates to listen to the corresponding speech file. The transcribers could modify the suggested pho-

for Austrian German and thus be relevant for ASR of Austrian German speakers. 3.4. Quality Assurance

Figure 1: Screen shot of WWWTranscribe. The two windows with the orthogarphic respectively phonetic transcription, a loudspeaker button on the right and two radiobuttons to start respectively save the transcription

We took measures to ensure error-free and consistent transcription. Apart from the clear guidelines defined in the course which was held before the actual transcriptions ( e.g. the definition of the phoneme set) there were weekly meetings to discuss potential problems. Besides there was a continous process where the group listened to recordings and amended the rule set. Finally a list of all transcribed phonemes was generated. In this list symbols were found, which obviously were typing errors since they were not in the list of the phonemes defined in the course. These ”non-phonemes” were replaced by the correct symbols.

4. Derivation of the Pronunciation Rules netic transcription by erasing and adding new characters in the second window. Finally the finished transcription was saved. Figure 1 shows a screenshot from WWWTranscribe. 3.3. The Phoneme Set and Transcription Procedures The phoneme set of the so called canonic form, laid down in the dictionary of the SpeechDat-AT database, was derived from the Aussprache-Duden [11], which is one of the major authoritative reference works on the standard pronunciation of German. Thus the phoneme sets of the German and the Austrian SpeechDat databases are the same. It encoded the utterances of the Austrian SpeechDat speakers according to the standard pronunciation rules of German German and was meant to serve as an intermediate reference point between the two national varieties. The fact that a phoneme set based on standard pronunciation was used to transcribe potentially non- or only near-standard pronunciations of Austrian speakers can be seen as a systematic bias to the analysis of German German and SpeechDat-AT Lexicon. To overcome these obvious shortcomings a selection of lexical items showing a wide range of phonetic environments was chosen and transcribed as narrow as possible to the phonetic substance. In this process the core transcription based on the canonic form was corrected and a narrow phonemic transcription achieved (e.g. no diacritics were used) which yielded a considerable amount of changes in the canonic forms. A further step in the correction process was the modification of the phoneme set of the canonic form to better fit the needs of the envisaged narrow transcription. Six new signs were introduced which stand for phonemes or speech-phenomena which were not covered by the signs of the original set of the ”canonic form” which are: 1) [:] Vowel length 2) [”] Voice (used to mark voice with the plosives [g], [b], [d] which in AG usually are voiceless) 3) [’] Lenis (used with the plosives [p], [t], [k] which in AG usually are voiceless fortes thus marking voicing 4) [6] Vocalisation (of post-vocalic R ) and by thus allowing a new set of Diphthongs: [I6/i:6], [Y6/y:6], [E6/e:6], [U6/u:6], [O6/o:6], [96/2:6 = /:] which are typical for AG speakers 5) [-] Nasalazation 6) [k=#=k] Coarticulion (used were homorganic sounds met at morpheme or syllable boundaries leading to the elision of one of them, e.g. ¡achttausend¿ pronounced as [ax=#=tausend]). The results of the adaptations and corrections are a definitely better mapping of the phonetic substance of the AG pronunciations. In a further step a statistical evaluation is to show us the frequencies of the different pronunciation forms and by that give an indication which of the forms can be considered as canonic

The transcribed data are read speech and do not represent the common colloquial language in Austria. It was not the goal of this project to analyse colloquial Austrian German but to yield a contribution for improvement of ASR used in dialogue systems for Austrian speakers. It is also important to note that we did not consider regional differences within Austria. The results laid down in this paper represent global phenomena that appear relevantly for a wide set of speakers of an Austrian national variant of German. 4.1. Types of Differences When we talk about ”differences in pronunciation” we actually mean ”differences in transcription”. But every transcription (either canonic or a real transcription of one word utterd by a given speaker) is made on the base of a pronunciation. Transcriptions and acoustic data are used together for training of Hidden Markov Models. Thus recorded speech and phonetic transcription are related to each other and the transcriptions should be as close as possible to their corresponding speech in order to avoid ambiguities. Such ambiguities are two or more different realizations of one phoneme (e.g. the e-schwa in prefixes or in suffixes where they are realised syllabic frequently) or - vice versa - two or more symbols for the same realization of one phoneme ( e.g. voiced or unvoiced fricative /s/). As discussed in chapter 3.3. the phoneme set was extended in order to facilitate narrow transcription. This results in two basic types of differences:

 Differences caused by the extension of the phoneme set. The canonic form can cover more variations of one pronunciation. For instance /E r/ can mean that the vowel /E/ is followed by the vibrant /r/ or e.g. in prefixes that the /r/ is vocalised. In order to receive a more precise representation of this phenomenon we introduced the pronunciation symbol /E6/.  Transcription which are not covered by the canonic form 4.2. Examples for Differences and their Phonetic Environments This chapter gives some examples of how the pronunciation can vary in some specific environments and how this influences the entries in the phonetic lexicon. The source file with all transcriptions consisted of three columns divided by tabulators: column 1 contains the orthographic transcription, column 2 the

canonic phonetic transcription and column 3 the new transcription for each word 1 . 4.2.1. The prefix ”ver-” The prefix ”ver-” is very frequent in German and occurs in words like ver¨andern = to change or Versteck = hiding place. A lexicon entry was considered a prefix ”ver-”, when the following regular expression (implemented in perl) was found in the source file: ^(v j V )er:  ntf E r:  nt (1) With this approach we found 412 entries. Their corresponding variants and their occurences are laid down in table 1. Variation 1 is covered by the canonic form and represents a vocalization of an /r/ to the a-schwa /6/. Variation 2 is the most frequent of this phonetic environment. Apperently it is a typical phenomenon that in a prefix the letters ”er” are realized as a monophone, which is the a-schwa /6/.

We introduced a new symbol for the syllabic ”n” which is /@n/2 . The following examples show that there are strong variations depending on the environment (table 3 to table 6). For every table there is a brief overview about the number of occurences in column 2 of the source file.

 Suffix ”-en” before nazals or approximants (table 3). The number of lexicon entries of the form /(njmjNjj) @ n/ amounts to 497. Table 3: Variations of suffix ”-en” after nazals or approximants transcription occurence

canonic form @n 80.1%

variant 1 En 14.7%

 Suffix ”-en” before bilabial plosives (table 4). The number of lexicon entries of the form /(bjp) @ n/ amounts to 238.

Table 1: Variations of prefix ”ver-” transcription occurence

canonic form fEr 4.6%

variant 1 f E6 32.5%

Table 4: Variations of suffix ”-en” after bilabial plosives

variant 2 f6 47.3%

transcription occurence

4.2.2. The voiced fricative ”s” The canonic form suggests two realizations of the orthographic ”s”: a voiced version /z/ and an unvoiced version /s/. The entries with a /z/ in the canonic form were greped with the following regular expression:

^:  s:  nt:  z:  nt

transcription occurence

variant 1 s 92.9%

This example shows, that there can be two symbols where in fact only one of them is realized ( the voiced version can be considered to be only of marginal importance). 4.2.3. The suffix ”-en” The suffix ”-en” in the German language occurs in words like the infinitiv of verbs lachen = to laugh as well as in plural forms of substantives Strassen = roads and is therefore one of the most frequently pronounced suffixes in the German language. Again a regular expression was used to pick out the the corresponding entries: ^:  ennt:  E nnt (3) 1 In the regular expressions described in the following a newline is denoted as nn, a tabulator is denoted as nt, the beginning of a line is denoted as ^, an OR is denoted as j and .* stands for a sequence of arbitrary characters

variant 2 @n 24.4%

Table 5: Variations of suffix ”-en” after vibrants transcription occurence

canonic form @n 37.0%

variant 1 6n 22.4%

variant 2 @n 19.8%

variant3 En 9.8%

 Else (table 6). The total number lexicon entries with all other phonemes followed by /@ n/ amounts to 2236.

Table 2: Variations of the voiced ”s” canonic form z 7.1%

variant 1 @m or m 42.9%

 Before vibrants (table 5). The number lexicon entries of the form /r @ n/ amounts to 192.

(2)

We investigated how often a /z/ in the canonic form is realized this way. The total number of the occurences of the phoneme /z/ in the canonic form is 2939. Table 2 shows the distribution.

canonic form @n 28.2%

5. Discussion The variations shown in the examples above occur in many more phonetic environments. Now pronunciation rules could be defined, which were implemented in perl using substitution commnads such as:

s=^(v j V )(er:nt)f E r(:)nn=n$1n$2f

6n$3ntf

E 6n$3nn=g ;

(4) This command line would replace the line in the canonic lexicon Vergebung by Vergebung

f E r g e: b U N f 6 g e: b U N

f E6 g e: b U N

Unfortunatetly this approach leads to ambiguities. For example the word Vers = verse would be transcribed now as /f E6 s f 6 s/ which is defenitely wrong. This problem was solved including morphologic and syllabic information to the canonic lexicon, which affects the previous 2 Note that there is no blank between the two symbols, i.e. it is considered a monophone rather than two seperate phonemes

8. References

Table 6: Variations of suffix ”-en” else transcription occurence

canonic form @n 24.2%

variant 1 @n 60.4%

variant 2 En 6.4%

example in the following way: Vergebung f E r + g e: b U N will be replaced by Vergebung f 6 g e: b U N f E6 g e: b U N where the ”+” stands for a morph boundary. Thus the regular expression was modified in the following way:

s=^(v j V )(er:nt)f E r n+ (:)nn=n$1n$2f

6n$3ntf

[1]

Burger, S. and Draxler, C., “Identifying Dialects of German from Digit Strings”, Proceedings of Eurospeech 97, Rhodes, Greece.

[2]

Vollmann, R. and Moosm¨uller S., “The change of diphthongs in Standard Viennese German. The diphthong / ”, 1999, Proceedings of the ICSLP 1999, San Francisco , USA.

[3]

Frenckenberger S., “Die Datenstruktur LIFT und der Regelinterpreter GRIP angewendet in der Sprachsynthese”, Ph.D. Thesis, University of Technology Vienna, Austria, 1993. ¨ B¨urkle Michael., “Zur Aussprache des Osterreichischen Standarddeutschen. Die unbetonten Nebensilbe.”, Peter Lang Verlag, Frankfurt/Main, Germany, 1995.

[4]

E 6n$3nn=g ;

(5) The entry of the word ”Vers” would not be touched since it contains no ”+” for the morph boundary. Vers fErs Stress markers have not been included in the pronunciation rules yet even though they were transcribed. This will be performed later. We also added information wether a word is native to German because the origin of a word provides information about stress and vowel lengths. Native words were marked with ”0”, non-native words with a ”1” and proper name with a ”2”.

[5]

Takahashi, H., “Die richtige Aussprache des Deutschen in ¨ Deutschland, Osterreich und der Schweiz nach Maßgabe der kodifizierten Normen”, Peter Lang Verlag, Frankfurt/Main, Germany, 1996.

[6]

Schiel, F. and Kipp, A., “Statistical Modelling of Pronunciation: It’s not the Model, it’s the data”, Proceedings of the ICSLP 1998, Sidney, Australia.

[7]

Beringer, N and Schiel, F. and Brietzmann, P., “German Regional Variants - A Problem for Automatic Speech Recognition?”, Proceedings of the ESCA Tutorial and Research Workshop on ’Modelling Pronunciation Variation for Automatic Recognition’ 1997, Kerkrade, Netherlands.

[8]

Burger, S. and Oppermann, D. “Regional Variants of German: Categories of Pronunciation from Standard German” 1999, Proceedings ICPhs, San Francisco, USA.

[9]

Baum, M. and Erbach, G. and Kubin G., “A telephone speech database for Austrian German”, Proceedings of LREC 2000 workshop Very Large Telephone Speech Databases (XL-DB).

6. Future Research The next step is to check wether the recogntion rate improves with the new lexicon. This will be done performing two seperate trainings with the two lexicons. Afterthat we will test the two AMOs with the same test set and compare the results. Ongoing research will be concerned with the inclusion of German and Swiss speakers. New accent regions within the complete German speaking area have to be defined in terms of acoustic phonetics and new AMOs have to be trained for these accent regions. However the final goal of our research is to find an approach for regional or national accent recognition. The speech recogniser could then (after the detection of the accent region) switch to the corresponding AMO and transcriptions. One such approach could be based on phonetic features which was investigated in a diploma thesis [12]. The results however did not yield good enough results for practical use in a speech recogniser. One reason could have been the realativela small set of data. Another reason might have been the selection of the speakers. They were taken from the SpeechDat-AT database and announced, that they spent their school time predominantely in Germany. Nontheless many of them live in Austria now and this naturally influences their pronunciation. Another way of detecting a regional variant could be parallel phoneme recognition [13].

7. Conclusions A phonetic lexicon for Austrian German was generated. It is based on narrow transcriptions of a big corpus and thus the first of its kind. It will be used to improve automatic speech recognition of national and regional variants of German. Apart from that it also yields new knowledge about the phonetics of Austrian German. Nevertheless it has be taken into account that we did not deal with fluent speech but only with speech which occurs in applications of ASR, for instance in dialogue systems.

[10] Draxler, C., “WWWTranscribe - A Modular Transcription System based on the World Wide Web”, 1999 Proceedings Eurospeech 99, Budapest, Hungary. [11] Drosdowski, G. and M¨uller, W., and Scholze-Stubenrecht, W. and Wermke, M., “Das Ausprachew¨orterbuch”, Dudenverlag Mannheim (1990). [12] Hagm¨uller M., “Recognition of Regional Variants of German using Prosodic Features”, Diploma Thesis, University of Technology Graz, Austria, 2001. [13] Kumpf, K. and King, R., “Automatic Accent Classification for Foreign accented Australian English Speech”, 1996, Proceedings of the ICSLP 1996, Philadelphia , USA.