Grapheme-to-Phoneme Conversion for Croatian ...

Grapheme-to-Phoneme Conversion for Croatian Speech Synthesis Lucia Načinović, Miran Pobar, Ivo Ipšić and Sanda Martinčić-Ipšić Department of Informatics, University of Rijeka Omladinska 14, 51000 Rijeka, Croatia Phone: (385) 51-345 046 Fax: (385) 51-345 207 E-mail: [email protected], [email protected], [email protected], [email protected]

Abstract - Grapheme-to-phoneme conversion is a part of a text-to-speech conversion system. In order to transform words in written form into speech, orthographic symbols have to be converted into phonetic symbols. In this paper we present some rules and patterns which appear in Croatian language and which are important for the grapheme-to-phoneme conversion. Suggestions for programming those rules in Perl language are also given.

I. INTRODUCTION Through technological mediation such as telephony, the Internet, movies, radio and television, spoken word is now extended and is no longer reserved for humans only. In the field of human-machine interaction, spoken language technology is being incorporated in the applications in home, mobile and office segments. [5] As a part of spoken language technology, speech synthesis plays a great role in the human-machine interaction. Speech synthesis or text-to-speech (TTS) systems convert words in written form into speech. Although it may seem trivial, the process of building a machinery system that can generate human-like speech from any text input is very demanding. First, orthographic symbols have to be converted into phonetic symbols. In order to sound natural, the intonation of the sentences must be appropriately generated. Moreover, human speech incorporates the components of the knowledge of the world and a good knowledge of the language itself. Therefore, it is very difficult to accomplish the naturalness of the human speech. The basic components in a TTS system are shown in Figure 1. In the text analysis, the text is being normalized so it becomes speakable (numbers, acronyms etc. are turned into the full words). The phonetic analysis component converts the processed text into the corresponding phonetic sequence and in the prosodic analysis appropriate pitch and duration information are being attached to the phonetic sequence. Finally, the speech synthesis component generates the corresponding speech waveform [5]. In this paper, we will especially concentrate on the phonetic analysis for the Croatian language. Some rules and patterns that occur in the Croatian language and which are important for the grapheme-to-phoneme conversion will be pointed out. Although in the Croatian language the orthographic sequence of symbols is much closer to the corresponding phonetic sequence of symbols than in some other languages such as English for example, conversion of graphemes into phonemes is still a great problem.

TTS System

Text Input

Text Analysis Document Structure Detection Text Normalization

Phonetic Analysis Grapheme-to-Phoneme Conversion

Prosodic Analysis Pitch & Duration Attachment

Speech Synthesis Voice Rendering

Speech Signal Figure 1. Basic system architecture of a TTS system.

TTS systems for some other Slavic languages that are similar to Croatian language already exist. For example there are TD-PSOLA [11] based TTS systems for Slovenian [13] and Serbian [12] languages. For the Croatian language, a diphone database for the MBROLA synthesizer was developed [15]. Festival [14] is a multilingual TTS system with the support for English, Spanish and Welsh languages. It can be extended to other languages and can use the MBROLA synthesizer [17]. The paper is organized as follows: in section two we point out the rules that occur in the Croatian language which can be incorporated in the Croatian speech synthesis system in order to improve the intelligibility and naturalness of the synthesized speech, in section three we describe the way in which those rules were implemented into the Croatian speech synthesis system, in section four the results of the evaluation of the synthesized speech after applying the rules are given and in the end an overview of the article and some suggestions for the future work are discussed.

II. RULES AND PATTERNS IN THE CROATIAN LANGUAGE IMPORTANT FOR THE GRAPHEME-TOPHONEME CONVERSION There are thirty graphemes in the Croatian language that mostly correspond to phonemes but we can notice some other sounds too if we listen carefully while someone is speaking Croatian. For example phoneme /n/ in the words rastanak and banka differs a lot. Phoneme /n/ in the word rastanak is the sound we connect to the orthographic symbol of a letter n and if someone ask us to pronounce letter n we will produce phoneme /n/ we can hear in the word rastanak. But if someone asks us whether phoneme /n/ as we pronounce when we say banka exists we would probably say that it does not exist although we pronounce that exact n every time we say the word banka or every time phoneme /n/ is followed by phonemes /k/ or /g/ in a word. But without special attention, we are not usually conscious of that. Besides the rule that phoneme /n/ is pronounced differently when it is followed by phonemes /k/ and /g/, in the Croatian language there are many other similar rules which are pointed out in the next chapter.

In Table 1. graphemes and their corresponding phonemes in the Croatian language are shown. There are thirty graphemes and thirty basic corresponding phonemes and additionally syllabic r and a diphthong ije which are acknowledged by all Croatian grammar books [2], [3], [4]. Grapheme-to-phoneme rules can be grouped according to the division of the syllables and their combining by the manner of articulation (section C., G.), place of articulation (section E.), similarity of articulation (section F.) and in the dependence of their surroundings ( section D., H.). B. Relationship between phonemes and their pronunciation

a

/a/

lj

/ļ/

b

/b/

m

/m/

c

/c/

n

/n/

č

/č/

nj

/ņ/

A phoneme can be pronounced in different ways. There are four main possible relationships between a phoneme and its pronunciation [2], [3], [4]: 1) phoneme pronounced in its typical way, i.e. phoneme which occurs most often in words; for example phoneme /n/ in [naprijed] – phoneme /n/ as pronounced in [naprijed] occurs most often 2) phoneme pronounced differently from its typical way in dependence of its surroundings; for example phoneme /n/ and its variant /ŋ/ in the words [staŋka], [taŋgo] etc. 3) phoneme pronounced as a typical pronunciation of another phoneme; for example: /z/ as [s] bez tebe [bestebe] /t/ kas [d] brat bi [bradbi] /d/ as [t] kod kuće [kotkuće] /s/ as [z] s bratom [zbratom] /n/ as [m] on bi [ombi] /z/ as [š] bez čaše [beščaše]. 4) phoneme pronounced differently in speech; for example: past ću [pašću] učit ću [učiću] kod tebe [kotebe] bez šuma [bešuma].

ć

/ć/

o

/o/

C. Voiced and voiceless consonants

d

/d/

p

/p/

dž

/Ǯ/

r

/r/ //

A. Graphemes and their corresponding phonemes in the Croatian language Table 1. Graphemes and their corresponding phonemes in the Croatian language GRAPHEME

PHONEME

GRAPHEME

PHONEME

đ

/Ʒ́/

s

/s/

e

/e/

š

/š/

f

/f/

t

/t/

g

/g/

u

/u/

h

/h/

v

/v/

i

/i/

z

/z/

j

/j/

ž

/ž/

k

/k/

ije

/ḭe/ /ije/

l

/l/

Table 2. Voiced and voiceless consonants Voiceless consonants Voiced consonants

p

t

k

č

ć

š

s

f

c

h

b

d

g

dž

đ

ž

z

-

-

-

In Table 2. voiced and voiceless consonants which occur in Croatian language are given. There are certain rules about the combination of those consonants: 1) voiced consonant is always followed by a voiced consonant; for example: svat - svadba (not svatba) glas - glazba (not glasba)

- exceptions: - in some compound words: ivanićgradski, Josipdol, tisućgodišnji - [ivaničgracki], [josibdol], [tisučgodišnji] - in some foreign words: jurisdikcija [jurizdikcia] - in foreign proper names: Tbilsi [dbilsi] 2) before voiceless consonant only voiceless consonant can occur; for example: vrabac - vrapca (not vrabca) gladak - glatka (not gladka) - exceptions: - in some foreign words and in foreign proper names: gangster [gankster], habsburški [habzburški], vašingtonski [vašinktonski], Habsburg [habzburg], Redford [retford] 3) in the sequences ds, dš (/ts, tš/) which are pronounced in speech as [c, č]; for example: gradski [gracki] odšetati [očetati] podstanar [poctanar] podšišati [počišati] podšivati [počivati]. D. Allophones In the pronunciation of every phoneme there are many slightly different variants depending on: 1) the surroundings of a phoneme in a word 2) the speaker – his/her age, mood, physiological conditions etc. Table 3. Allophones phoneme /n/ before /k/ and /g/ - [ŋ]

banka [baŋka] tango [taŋgo]

phoneme /n/ before /č/ - [ṇ]

rastanče [rastaṇče]

phoneme /n/ before /ć/ and /Ʒ́/-[n’]

inćun [in’ćun] anđeo [an’Ʒ́eo]

phoneme /c/ before sonorous consonant – [Ʒ]

stric bi došao [striʒbi došao], zec ga gleda [zeʒga gleda]

phoneme /h/ before sonorous consonant – [γ]

strah ga je [straγgaje] rekoh da [rekoγda]

phoneme /f/ before sonorous consonant– [F]

grof ga gleda [groFga gleda] šef ga pita [šeFga pita]

phoneme /m/ before /b/ - [ɱ]

bomba [boɱba]

phoneme /m/ before /v/ - [m̢]

In Table 3. allophones that occur in Croatian and the example for each are listed. E. Division of the consonants according to the place of articulation Because consonants are made by restricting the airflow in some way, they can be distinguished by the place this restriction is made. In Table 4. the division of the consonants according to the place of articulation is given. Table 4. The division of the consonants according to the place of articulation nonpalatal

s

z

h

-

-

-

-

-

palatal

š

ž

č

ć

Ʒ́

Ǯ

ļ

ń

dental

n

-

-

-

-

-

-

-

labial

p

b

m

-

-

-

-

-

Rules about combining those consonants in Croatian language: 1) before palatal consonants /š, ž, č, ć, Ʒ́, Ǯ, ļ, ń/, palatal consonants /š/, /ž/ occur but not non-palatal consonants /s/, /z/, /h/; for example: orah oraščić (not orahčić) trbuh trbuščić (not trbuhčić) paziti pažnja (not paznja) 2) before labial consonants /p/ and /b/, labial consonant /m/ occurs but not dental consonant /n/; for example: hiniti himba (not hinba) stan stambeni (not stanbeni) jedan jedanput [jedamput] izvan izvanbrodski [izvambrodski] That rule applies in writing when suffix begins with /p/ or /b/ but not in other cases (izvanbrodski, jedanput). In speech, however, the rule applies in all of the cases – [jedamput], [izvambrodski]. F. Omission of the consonants

tramvaj [tram̢vaḭ]

according to [2] invalid [iɱvalid] phonemes /n/ and /m/ informacija [iɱformaciḭa] before phonemes /f/ i tramvaj [traɱvaḭ] /v/ - [ɱ] komfor [koɱfor] phoneme /š/ before /ć/ - [ś]

lišće [liśće]

phoneme /ž/ before /Ʒ́/ - [ź]

grožđe [groźƷ́e]

According to the sameness or similarity of the consonants by the articulation, there are certain rules about combining of those consonants. One of them is important because there are some exceptions in which that rule is applied in spoken but not in the written form of the word: - between the consonants /s/ and /z/ and some other consonant, besides /r/ and /v/, consonants /t/ and /d/ cannot be realized; for example: gost gozba (not gostba) nužda nužni (not nuždni) -exceptions: 1) in the sequence of the consonants stn when it occurs in the adjectives derived from the foreign word:

aoristni [aorisni] azbestni [azbesni] protestni [protesni] 2) in the sequence of the consonants stk when it occurs in the nouns of feminine gender derived from the nouns of masculine gender ending with -ist: feministkinja [feminiskińa] idealistkinja [idealiskińa]. G. Syllabic r, l and n Syllabic r, l and n differ from the "ordinary" r, l and n consonants by the sonority and the duration. They are usually more sonorous and they last longer. Syllabic r occurs in the beginning of the word followed by a consonant ([zati]), between two consonants from which the first one is not /j, r, l, ļ, n, ń, ć, Ʒ́, Ǯ/ ([vt]) and in the end of the foreign words after a consonant ([masak], [žan]). Syllabic l and n occur in the foreign words between two consonants [Ʒ́entmen], [rehṇšiber] or in the end of the word after a consonant [bicik, fascik, ńutṇ, šmirg].

dc

pod cestom [poc:estom].

I. Some additional rules Besides the listed rules, there are some additional rules that occur in the Croatian language: 1) in the words with the sequence četiri, the first phoneme i is eliminated in speech: četiri [četri] 2) sequence ts in words is phonetically transformed into c: Hrvatska [hrvacka] 3) sequence ije in words becomes je in phonetic transcription, but there are some exceptions too (dvije, prije, pijem, grijem...) 4) sequence ae in numbers is being pronounced as [a]: jedanaest [jedanast].

III. IMPLEMENTATION OF THE RULES IN THE CROATIAN SPEECH SYNTHESIS SYSTEM

H. Rules about the occurrence of the phonemes in a word In the Croatian language all of the phonemes can occur at any place in a word: in the beginning, in the middle or in the end. Some of the long, doubled consonants make an exception: 1) consonants [c:, č:, ć:, Ǯ:, Ʒ́:] which occur in speech instead of the: a) sequences dc, dč, dć, ddž, dđ (/tc, tč, tć, dǮ, dƷ́/) from which the first consonant is a part of the prefix and the other is a part of the base; for example: nadcestar [nac:estar] podčiniti [poč:initi] odćarlijati [oć:arliati] naddžepak [naǮ:epak] b) sequences tc, tč, dc, dč (/tc, tč/) when they occur between the base and the suffix; for example: N bitka DL bici and bitci [bic:i] N mladac V mlače and mladče [mlač:e] c) sequences tc, tč, dc, dč, tdž, ddž (/tc, tč, dǮ/) from which the first is a part of the base and the other is a part of the suffix; for example: korito korice and koritce [koric:e] medvjed medvječe and medvjedče [medveč:e] 2) consonants [b:, p:, t:, s:, z:, š:] which occur in speech instead of the sequences bb, bp, dt, zs, zz, zš, dd (/bb, pp, tt, ss, zz, šš, dd/) in the compound words with prefixes; for example: subbioceanski [sub:ioceanski] subpolaran [sup:olaran] poddinarski [pod:inarski] 3) doubled consonants also occur between two words; for example: tć radit ću [radić:u] tč brat čeka [brač:eka]

In concatenation based TTS system for Croatian the implemented method is based on PSOLA (Pitch Synchronous Overlap and Add) algorithm [11]. The input to the synthesizer is string of phones and optionally between - word pause duration in samples [18]. The output signal is calculated by overlapping and adding the diphones accordingly to the input phone sequence. First the signals are overlapped and aligned to the pitch marks. Then the overlapping parts of the signals are multiplied by a Hanning window and added together, and the nonoverlapping parts are copied directly into the output signal. Basic functionality of the phonetic analysis module is provided by a Matlab function that takes some normalized text as input and outputs the corresponding phone string. Additional rules that are given in the second chapter were programmed in Perl programming language [16]. Since the existing diphone database for the Croatian language speech synthesis system does not include all the syllables and allophones which are given in this paper, some of the rules had to be omitted. In order to apply all of the rules, the mentioned database should be extended with additional syllables and allophones. In Table 5. we can see the difference between the input text and the output text after applying script in Perl language [16] by which the rules that can be applied to the existing database for the Croatian language were programmed. The reason we used Perl programming language is that Perl is a language that deals with text very easily and it makes some things programmers do with the text like search and replace, saving and loading files, making lists, and so on much easier [9]. One of the most important characteristics of Perl is its powerful manipulation of the expressions. In the heart of that are regular expressions. Regular expressions are very powerful devices to describe patterns to search in a text. They are composed of literal characters, that is, ordinary text characters like abc, and of metacharacters like * that have a special meaning. The simplest form of regular

expressions is a sequence of literal characters: letters, numbers, spaces, or punctuation signs. Example of a regular expression which replaces sequence rijeka into Rijeka no matter how many times it occurs in the text: $recenica=~s/rijeka/Rijeka/gi.

A. Survey Results On the first set of questions 56% of the evaluators noticed the difference between the two fragments and they all thought the fragment which was synthesized after implementing the new rules was better in comparision with the fragment which was synthesized without the new rules.

Table 5. Difference between the input and the output text after grapheme-to-phoneme transcription using Perl script for the existing diphone database for the Croatian language

Do you notice the difference in the intelligibility and quality betw een the tw o fragm ents? 60%

Nino je kod kuće s bratom. Oni žive u Josipdolu. Josipdol je habsburški grad. Nino ima petnaest godina, a njegov brat četiri. Predsjednik republike Hrvatske jedanput je izjavio da se gradski proračun mora određivati na lokalnoj razini. Lokalna se jurisdikcija ne smije podcijenjivati. Nino je kotkucce zbratom. Oni Zive u Josibdolu. Josibdol je habzburSki grad. Nino ima petnajst godina, a Negov braC:etri. Precjednik republike Hrvacke jedamput je izjavio da se gracki proraCun mora odredzivati na lokalnoj razini. Lokalna se jurizdikcija ne smje pocjeNivati.

50% 40% Yes

30%

No

20% 10% 0%

Figure 2. The evaluation of the difference in the intelligibility and quality between the two fragments.

On the second set of questions 89% of the evaluators noticed the difference between the two fragments after concentrating on the specific words. Do you notice the improvement in the intelligibility and quality of the specific words in the second fragment? 100%

IV. EVALUATION OF THE SYNTHESIZED SPEECH

80%

AFTER THE IMPLEMENTATION OF THE NEW RULES

60%

Yes No

40% 20% 0%

To see whether the implementation of the new rules had an impact on the intelligibility and quality of the synthesized speech, we conducted a survey among 18 evaluators. They first listened to a fragment of synthesized speech without the implementation of the new rules and then they listened to a fragment after implementing the new rules. Input texts for both fragments were the same (it is given in Table 5.). The evaluators then answered to the following questions: 1. Do you notice the difference in the intelligibility and quality between the two fragments? 2. If you do, which fragment is more intelligible and of better quality? The evaluators were then asked to listen to the both fragments again but with paying attention only to some specific words (the underlined words in Table 5) on which the new rules have an impact. Then they answered the following questions: 1. Do you notice the improvement in the intelligibility and quality of the specific words in the second fragment? 2. If yes, what is the difference in the intelligibility and quality of the specific words? Possible answers were: almost no difference, little difference and considerable difference.

Figure 3. The evaluation of the improvement in the intelligibility and quality of the specific words in the second fragment

The answers to the second question, regarding the level of the difference in the intelligibility and quality of the words in two fragments, the answer were as follows: almost no difference: 28% little difference: 67% considerable difference: 5% What is the difference in the intelligibility and quality of the specific words? 80% 60% 40%

almost no difference little difference considerable difference

20% 0%

Figure 4. The evaluation of the difference of the improvement in the intelligibility and quality of the specific words in second fragment.

V. CONCLUSION In this paper, rules and patterns that occur in the Croatian language and which are important for the grapheme-to-phoneme conversion are pointed out. Different phonemes in different surroundings are slightly differently pronounced. When those differences occur on regular basis, they can be separated as rules that can be applied in TTS systems so the synthesized speech sounds more natural. Those rules were programmed in Perl language and the example of the difference between the input and the output text after grapheme-to-phoneme transcription using Perl script for the existing diphone database for the Croatian language was given. A survey among 18 evaluators was conducted to determine if there is a perceivable difference in the quality of synthesized speech when these rules were applied and has shown limited improvements. The existing diphone database for Croatian language doesn’t currently support all of the syllables and allophones used by the rules mentioned in this paper and to be able to apply them all the database should be extended with those missing allophones. The perception of quality improvement after using these rules might also be limited by other currently more pronounced issues such as general prosody or noise in synthesized speech.

REFERENCES [1] D. Božić, Automatska prozodija za autonomnog virtualnog reprezentatora, diplomski rad, Fakultet elektrotehnike i računarstva, Zagreb, 2007. [2] D. Brozović, Fonologija hrvatskoga književnoga jezika in Babić S., Brozović, D., Moguš, M., Pavešić, S., Škarić, I., Težak, S., Povijesni pregled, glasovi i oblici hrvatskoga književnoga jezika, Nacrti za gramatiku, HAZU, Globus, Zagreb, 1991, pp. 381-450. [3] Silić, J., Pranjković, I., Gramatika hrvatskoga jezika: za gimnazije i visoka učilišta, Školska knjiga Zagreb, 2005. [4] Zečević, V. Fonetika i fonologija in Barić, E., Lončarić, M., Malić, D., Pavešić, S., Peti, M., Zečević, V., Znika, M. Hrvatska gramatika, Školska knjiga Zagreb, 1997. [5] Xuedong, Huang, Alex, Acero, Hsiao-Wuen, Hon, Spoken Language Processing: a guide to theory, algorithm, and system development, Prentice Hall, 2001. [6] V. Demberg, Letter-to-Phoneme Conversion for a German Text-to-Speech System, http://homepages.inf.ed.ac.uk/s0455377/Diplomarbeit.pdf [7] A. Font Llitjós, Improving Pronunciation Accuracy of Proper Names with Language Origin Classes.B.A., http://fife.speech.cs.cmu.edu/pronounce-names/mthesiscmu.htm#_Toc528071064 [8] D. Gibbon, Talking Computers:Grapheme-to-phoneme conversion, http://wwwhomes.unibielefeld.de/gibbon/Classes/Classes2007WS/TalkingCom puters/03aPM2 [9] P. Nugues, An Introduction to Language Processing with Perl and Prolog, Springer, 2006.

[10]

D.

Sheppard, Beginner's Introduction to Perl, http://www.perl.com/pub/a/2000/10/begperl1.html [11] E. Moulines, F. Charpentier: Pitch Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones, Speech Communication, 1990, 9:453467. [12] M. Sečujski, R. Obradović, D. Pekar, Lj. Jovanov, V. Delić: AlfaNum sistem za sintezu govora na osnovu teksta na srpskom jeziku, TSD 2002, Brno, pp. 237-244. [13] J. Gros, N. Pavešić, F. Mihelič: Text-to-Speech Synthesis: A Complete System for the Slovenian language, Journal of Computing and Information Technology – CIT 5, 1997, 1, 11-19. [14] P. Taylor, A. Black, R. Caley: The architecture of the festival speech synthesis system, The Third ESCA Workshop in Speech Synthesis, Jenolan Caves, Australia, 1998, pp 147-151. [15] T. Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997. [16] L. Načinović: Grafemsko-fonemska pretvorba u hrvatskom jeziku, diplomski rad, Filozofski fakultet u Rijeci, 2008. [17] Pobar, Miran, Martinčić–Ipšić, Sanda, Ipšić, Ivo, Text-ToSpeech Synthesis: A Prototype System for The Croatian Language, Engineering Review. Vol. 28(2), pp.31-44. 2008.