ARAB_TTS: An Arabic Text To Speech Synthesis Zouhir ZEMIRLI Institut National d’Informatique, LMCS, BP 35M – 16309 - Oued-smar Algiers ALGERIA
[email protected] several prosodic factors through the observation of the duration of the phonemes [3] according to the syllabic structures, the pauses, and the contextual variation. Normal, negative, interrogative and exclamative intonate contours is produced. We finally discuss the results obtained and the contribution of the tools developed for the generation of the prosodic phonetic chain producing an almost natural Arabic Speech Synthesis.
Abstract Research on speech synthesis, and particularly on prosodic generation, is still the biggest challenge in Textto-Speech Synthesis. In the MBROLA project, database voices (i.e., the DSP part of a TTS system) for Arabic, Brazilian, Portuguese, Breton, British English, Dutch, French, German, Romanian, Spanish,… are available. Several complete TTS Systems for Indo-European languages were developed around this project but not yet for the Arabic language. The quality of a text to speech synthesis depends on the naturalness, on the intelligibility of the speech generated and the specific characteristics to the produced voice. These characteristics depend on the techniques and the methods of synthesis, but also on the care taken to linguistic and prosodic modeling. Several works underlines the fact that linguistic structures maintain the close links with the prosodic achievements. Concerning the Arab language, the models are based on the syllabic structure of the words, the stressing, the concept of markers intonates (interrogative mainly) and very little on syntactic information. The object of this paper is to describe modelling and the integration of phonological, morpho-lexical and syntactic knowledge necessary to the development of a complete Arabic Text To Speech System starting from diacritized Arab texts.
2. ARAB_TAG: Tagging and Syntactic groups This module is significant in any text to speech synthesis system, because the insertion of the pauses and the generation of the prosodic markers can be made only if we have a minimum of grammatical information on each word of the sentence. The texts subjected to the entry of our system are correctly diacritized. The function of this system is to identify any diacritized word and to affect a morpho-syntactic label to him (unaccomplished verb, pronoun, noun...). The major problem provide from the affixed elements, which are related to the word (suffixes and/or prefixes). With the exception of the particles any linguistic form can be analyzed in a root and a stem. Root and stem are the fundamental concepts of Arab morphology. The decomposition of the word: ﺳ ِﺘﻬِﻢ َ ( ِﺑﻤَﺪ َرin their school) produces (ب | ﻣَﺪ َرسَ| ِﺗﻬِﻢ ِ ). ARAB_TAG aims to produce a specific syntactic analysis for its use in real time in a TTS system. The syntactic analysis aims to contribute to improve quality of the ARAB_TTS to make it very acceptable. It is necessary to adapt the analyzer to the constraints in term of performances but also in term of results (fast, robust and deterministic). Any system of morphosyntactic analysis contains lexical resources and an analyzer. The electronic resources in Arab language are not yet available and are not diffused for an automatic treatment although many work of laboratory was carried out. We defined for our work some lexical resources, certainly partial, but that we judged sufficient to achieve our goals. Four lexicons are defined to provide all necessary grammatical information to the analyzer. The lexicon of the words tools (prepositions, coordinating conjunctions, particles of interrogation, etc). The lexicon of the specific word contains all the no derivable words. The lexicon of the generating stems is relating only to the verbs. The last
1. Introduction Actually the MBROLA project [1] support 37 languages and for each language, one or more voices are available (72 voices in all). For the Arab language two databases of diphones are available AR1 and AR2. AR1 is used in the ARAB_TTS system. In this article, we will focus our works in the Arab language. The morphological model of the Arab language will be presented with in particular the concepts of word, root, affixed elements, regular and irregular forms and grammatical categories. Our morpho-syntactical tagger (ARAB_TAG) provides information’s necessary to produce the correct transcription into phonemes of the words of a text and the calculation of the prosodic contour of the sentence to be synthesized (stress, pauses and intonations). The graphemes phonemes transcription requires morphoorthographical rules, phonological rules, parsers, lexicons which were already described in system SYNTHAR+ [2]. To produce a naturally sound, we were interested in
1-4244-0212-3/06/$20.00/©2006 IEEE
976
lexicon contains the affixed elements (antefixes and suffixes). In order to facilitate the phase of segmentation of the verbs and the nouns, we have pre-treated the antefixes according to their size, time and the attested combinations. Five modules of conjugation are used for the accomplished, the unaccomplished and the requirement. (For example with mode = achieved, design = 1 ( َﻓ َﻌ َﻞand anybody = 12 ( هُﻢ, the result is: َﻓ َﻌﻠُﻮا. The words tools are gathered according to their function in tables (table of the particles of conjunction, personal pronouns, etc). They form a finished unit and these tables contain all the associated forms: simple ﻓِﻲ/ fii /, suffixed ﻓِﻴ ِﻪ/ fiihi /, ﻓِﻴﻬَﺎ/ fiihaa /, affixed وَﻓِﻲ/ wafii /, or they at the same time َوﻓِﻴﻬِﻢ/ wafiihim /). The specific words are nouns which do not have a root in the Arab language. They are no derivable words and some are inflexible, it is then necessary to count them in order to treat them correctly. In the inflexible words, we find the proper nouns (nouns of country or people) or common nouns. The lexicon of the verbal stems is subdivided in three lexicons corresponding to the three times of the verb. For the accomplished time: 49 stems for the active voice and 19 stems for the passive voice, and for the unaccomplished time: 33 stems for the active voice and 19 stems for the passive voice, and 43 stems for the requirement. From a prosodic point of view a syntactic sentence that we will define cannot be cut out and will not comprise a pause. A syntactic sentence will have like central element a verb or a noun which will be connected by particles of coordination, relation, genitive, pronouns, etc. We defined three classes of labels: particles, verbs and nouns. A verb will be characterized by its time (accomplished/ unaccomplished and requirement) and its mode (active or passive) which contain information useful for the generation of the prosody. Accidental inflections of the nouns: the determined/undetermined subject, accusative and genitive as their type play a role determining in the classification of the labels of this class. On the basis of these observations, we defined for our needs 35 grammatical labels. They are divided into three categories: 4 labels for the particles, 16 labels for the verbs, and 15 labels for the nouns. [4] defines 21 labels; it considers the coordinating conjunction ' 'وas a lexical entity separated from the word which follows it! [5], [6], [7], [8] in their significant work in the automatic treatment of the Arab language uses a great number of grammatical labels (606 for Débili [5]) which were never used in a complete ARABIC TTS functioning in real time. The order of treatment of the words of a text is significant because, it makes it possible to minimize the errors of labelling. We Analyze the words tools and the specific words, then the verbal forms and finally the nominal forms. The labelling of Arab texts even entirely diacritized can lead to cases of
ambiguities of labelling. For example, أﺣ َﻤ ُﺪis a verb with unaccomplished or a noun with the prone case. ن َأ ﱠ is a particle of Nasb or a verb with accomplished (it complains). Contextual rules of clarification are used to assign a label according to the contexts while acting on the located words. For example, ن َأ ﱠwill be labelled HarfNasb in: .ﻀ ًﺔ ﻄ ُﺮ َذ َهﺒًﺎ َو َﻻ ِﻓ ﱠ ِ ﺴﻤَﺎ َء َﻻ ﺗُﻤ َ ن اﻟ ن َأ ﱠ َ ﺗَﻌ َﻠﻤُﻮand it will receive the ﻦ ا َﻷ َﻟ ِﻢ اﻟ ﱠ َ ﺾ ِﻣ ُ ن اﻟ َﻤﺮِﻳ َأ ﱠ VerbAcc label in: .ﺸﺪِﻳ ِﺪ For example, ARAB_TAG produce the regrouping and markers below: ##( ﺳ ِﺔ اﻹِﺑ ِﺘﺪَا ِﺋ َﻴ ِﺔ؟ َ إﻟَﻰ اﻟﻤَﺪ َر$)| ( ﺐ اﻟ َﻮ َﻟ ُﺪ اﻟﻤُﺠ َﺘ ِﻬ ُﺪ َ َذ َه£)@هَﻞ ## (ﻓِﻲ اﻟﻌَﺎ َﻟ ِﻢ اﻟﻌَﺮَﺑِﻲ$) | (ﺤﺪِﻳ َﺜ ِﺔ َ ﻀ ِﺔ اﻟ َ ﻣِﻦ ُروﱠا ِد اﻟﻨﱠﻬ$) ## (.ﻲ ِ ﻓِﻲ اﻟﻌَﺎ َﻟ ِﻢ اﻟ َﻌ َﺮ ِﺑ$) | (ﺤﺪِﻳ َﺜ ِﺔ َ ﻀ ِﺔ اﻟ َ َأ َه ﱢﻢ ُروَا ِد اﻟﻨﱠﻬ$) | (َأرَدﻧَﺎ( | )§ﺗَﺴ ِﻤ َﻴ َﺔ£) | ()إذَا
The potential syntactic groups of breath are delimited by brackets. ## indicates an obligatory pause (punctuations). @ symbolizes an interrogative marker. £: marker for a verbal group, ¤: marker for a subject group, §: marker for an accusative group and $ marker for a genitive group. The symbol | indicates an optional pause. The four syntactic groups defined are: VG: verbal group, NSG: Nominal Subject Group, NAG: Nominal Accusative Group and NGG: Nominal Genitive Group). Example, from the sentence: ﺸ َﺔ َ ﺟ ٌﻞ اﻟﺒُﺮ ُﺗﻘَﺎ َﻟ َﺔ اﻟﻤُﻨ ِﻌ ُ َﺐ اﻟﱠﻠﺬِﻳ َﺬ َوَأ َآ َﻞ ر َ ﺤﻠِﻴ َ ﻒ اﻟ ٌ ب ﻃِﻔ ٌﻞ ﻟَﻄِﻴ َ ﺷ ِﺮ َ .
We carry out the following syntactic grouping: (.ﺸ َﺔ َ ﺟ ٌﻞ( )اﻟﺒُﺮ ُﺗﻘَﺎ َﻟ َﺔ اﻟﻤُﻨ ِﻌ ُ َﺐ اﻟﱠﻠﺬِﻳ َﺬ( ) َوَأآَﻞَ( )ر َ ﺤﻠِﻴ َ ﻒ( )اﻟ ٌ ب( )ﻃِﻔ ٌﻞ ﻟَﻄِﻴ َ ﺷ ِﺮ َ )
Then, we carry out the insertion of pauses (#) between the groups of breath using simple rules. .(.ﺸ َﺔ َ ﺟ ٌﻞ( )اﻟﺒُﺮ ُﺗﻘَﺎ َﻟ َﺔ اﻟﻤُﻨ ِﻌ ُ َ ) َوَأ َآ َﻞ( )ر# (ﺐ اﻟﱠﻠﺬِﻳ َﺬ َ ﺤﻠِﻴ َ ﻒ( )اﻟ ٌ ب( )ﻃِﻔ ٌﻞ ﻟَﻄِﻴ َ ﺷ ِﺮ َ )
The produced labels are used to define the stressed syllable of the words but also the height of this stressing. From a corpus of 21313 words, ARAB_TAG generated an error rate of 1 % on the labels, which involved less than 1% of errors on the frontiers of syntactic sentences. Nearly 99% of the automatically inserted pauses are correctly placed.
3. Graphemes to Phonemes Transcription Some pre-treatments are necessary before carrying out conversion graphemes to phonemes. We can find numbers (1999), notations of dates (01/01/1998) or hours (12:30), various monetary units (€, $, ج.د.) and usual abbreviations ()آﻠﻎ, or particular abbreviations like (ق.)ر. For more information, the reader will be able to refer to [2]. In this transcription, we combine the two methods, by using lexicons of exception (ِإﻟَﻰ َأ ِﺧ ِﺮ ِﻩ → اﻟﺦ, َﻻ ِآ َﻦ → َﻟ ِﻜ َﻦ, َه ُﺆ َﻻ ِء → هَﺎ ُؤ َﻻ ِء, و.ت.ع.ب.ﺚ اﻟﻌِﻠﻤِﻰ → ع ُ ) ِوزَار ُة اﻟﺘَﻌﻠِﻴ ِﻢ اﻟﻌَﺎﻟِﻲ َو اﻟﺒَﺤand a database of rules of transcription. In order to symbolize the best context of the graphemes, the rule have this form: LC + CT + RC Æ / PS /. Left Context + Characters to Transcribe + Right Context Æ / Phonetic Sequence /. The phonetic sequence can indicate a complementary action to carry out. The number of rules (250) being rather
977
significant, we have classed the graphemes appearing in CG and CD in 9 classes. 0 (A : any character), 1 (C : All consonant), (2 LC : lunar consonant), 3 (SL : solar consonant), 4 ( V : vowel), 5 (SV : short vowel), 6 (LV : long vowel), 7 (EC : emphatic consonant), 8 (NEC : nonemphatic consonant). Some rules are described below. 8a0 Æ /a/ ﺧ َﻞ َ َد /daXala/ 7a0 Æ /a./ ب َ ﺿ َﺮ َ /d.a.raba/ 7uwa# Æ /uu./ ُوارَﺿـ/rad.uu./ 0A0 Æ /?aa/ ﻦ َ ﺁ َﻣ /?aamana/ 0R0 Æ /t./ ق َ ﻃ َﺮ َ /t.a.raqa/ 1Eé# Æ /an/ َﻓﺘًﻰ. /fatan/ 1k²4 Æ /kk/ ﺐ َ َر ﱠآ /rakkaba/
naX&nu*~mas*ruu&ruun*~?an*~nu*qad&di*ma*~la &kum~ni*Maa&ma*nan*~naa&Ri*qal*~?aa&lii*~lin* nu*Buu&Bil*~Ha&ra*bi*ja~ '&' indicates that the stress is carried by this syllable. '*' indicates a non-stressed syllable and the '~' is the separator of words. The isolated words in Arabic receive a stress, which will be carried by the stressed syllable; The parameter of F0 of a stressed syllable is very significant. On the level of the sentence, intonate contour is represented starting from the contours intonates calculated on the level of the syntactic groups which constitute this sentence. For the estimate of the contour of F0 we apply the following rules: Frequency of beginning of sentence = 120Hz. We note FC the frequency of the current phoneme. For each lexical stress we add a FAL (Frequency of the lexical stress) to the frequency beginning of the stressed word. This lexical stress is related to a word or a group of words. This operation is carried out before the calculation of the slopes of the words. For each word the slopes are calculated according to GSC. The frequency of the stressed syllable is equal to FC+20Hz. (marker &). The not voiced phonemes have a null frequency. The sentence ﺳ ِﺔ َ ﺐ اﻟ َﻮ َﻟ ُﺪ إﻟَﻰ اﻟﻤَﺪ َر َ َذ َه. will be tagged: (.ﺳ ِﺔ َ إﻟَﻰ اﻟﻤَﺪ َر$) (اﻟ َﻮ َﻟ ُﺪ¤ ﺐ َ َذ َه£) and the result of the syllabation, word stressing and lexical stressing is: £Da&ha*bal*¤wa&la*du*$?i&lal* mad&ra*sa*ti*##. The interpretation of the prosodic markers used is: £ intonate rise of +20Hz, ¤ intonate rise of +25Hz, § intonate rise of +30Hz and $ intonate rise of +25Hz. Acoustic interpretation: for each phoneme we associates 8 parameters. For example, for the phoneme a, the frequency decrease, we generate the following parameters: (a 77 15 138 50 135 75 132), a is the phoneme, 77 is the duration, 138, 135 et 132 are the values of the pitch range at 15%, 50% and 75% of the total duration of phoneme. The intonation curve of the َ ﺧ َﺮ َ ”, “the child went out the house” is sentence “ج اﻟ َﻮَﻟ ُﺪ ِﻣ َﻦ اﻟﻤَﻨﺰِل
For example, the sentence: . هـ1419 َر َﻣﻀَﺎن23 م اﻟ ُﻤﻮَاﻓِﻖ ل1999 ﺟﻨﻔِﻲ َ 10 ﺣﺪ َ ا َﻷ is converted into the phonetic string: ?al?aXad~HaSara~Zanfii~?alfun~wa~tisHa~mi?ata~w a~tisHa~wa~tisHuuna~milaadil~muwaafiq~li~TalaaTa~ wa~HiSruuna~ramaQaan~alfun~wa~?arbaHa~mi?ata~wa ~tisHa~HaSara~hiZrii.
4. Prosodic generation Among the three prosodic parameters –fundamental frequency, duration and intensity – the duration of the sounds remains most difficult to model. This one depends on the context of realization of the phonemes: nature, size and structure of the syllable, stress, etc. These last years many works [9], [10], [11] on the phonemes duration and models of intonations was undertaken. In [3] we have identified the effects of the immediate context over the duration of the phonemes, the lengthening of the geminated consonants as well as the influence of the syllabic structure over the duration in continuous word. To calculate intonate contour (stressing of the text); we have to determine the zones when F0 is its maximum value. We distinguish two types of stress, the stress inherent to a word, which depends on the syllabic structure of the word and the global stress or intonation, relating to the sentence (syntactic groups). The number of syllable of each word is calculated, it represents the global syllabic coefficient GSC. It is used for calculation of the duration of the phonemes and for calculation of the slope of intonations. .ص اﻟ َﻌ َﺮ ِﺑ َﻴ ِﺔ ِ ﻖ اﻵﻟِﻲ ﻟِﻠ ُﻨﺼُﻮ َ ﻃ ِ ِﻧﻈَﺎ َﻣﻨَﺎ اﻟﻨَﺎ: ﻦ ﻣَﺴﺮُورُون أَن ُﻧ َﻘ ﱢﺪ َم َﻟﻜُﻢ ُ ﻧَﺤ Phonemes (naXnu) ﻦ ُ ﻧَﺤ (masruuruun) ﻣَﺴﺮُورُون (?an ) أَن (nuqaddima) ُﻧ َﻘ ﱢﺪ َم (lakum) َﻟﻜُﻢ
Syllables (CVC*CV*) (CVC*CVV*CVVC*) (CVC*) (CV*CVC*CV*CV*) (CV*CVC)
shown below.
Xa&ra*Zal*¤wa&la*du*$mi&nal~man&zil*## 200 150 100 50 0 x
GSC 2 3 1 4 2
a
r
a
Z
a
l
w
a
l
a
d
u
m
i
n
a
l
m
a
n
z
i
l
5. Evaluation The objective is to evaluate: the global perception of the vocal message in terms of quality, naturalness and acceptability, and an analytical perception concerning the identification of the various phonemes, the stressing and the intonation. In this evaluation, we use a corpus of 200
The syllabication and the stressing of the words of the sentence produce:
978
isolated words (1 to 5 syllables) and 42 sentences including 389 words overall. Ten listeners with different levels of knowledge in Arab language (nine and a visually impaired person) were subjected to the tests. The listeners had never used vocal technologies of synthesis of the Arab language. A first use (without evaluation) of ARAB_TTS was carried out. The listeners could listen some messages (several times) and to adjust the following parameters: Speed (1-10), Volume (1-10), Pitch (1-5) and listen to the final vowels or not. A first opinion on the global acceptability of the system was required. ARAB_TTS was accepted by all the listeners and especially well accepted by the visually impaired listener. In the global evaluation step of the system, a note going from 1 to 4 is attributed. (1 - bad, 2 - means, 3 good and 4 for very good). REF is the original Signal, REC signal of recopy of prosody, PLA signal produced without prosody (only phonemes and a fixed duration) and PRO the prosodic signal produced by ARAB_TTS. The scores relating to the global intelligibility of ARAB_TTS are: REC=3.82 / PRO=3.07 / REF = 4.00 / PLA = 1.83. The synthetic signals obtained by recopy of prosody have a score relatively very good 3.82. The synthetic signal obtained automatically by ARAB_TTS is good with a score of 3.07. The signal (PLA) without prosody is very understandable but completely mechanical. In the analytical evaluation, we asked the listeners to transcribe only the words badly understood. Few errors was detected in the ARAB_TTS. The gemination of the occlusive consonant ( تt) was badly perceived. This perception was corrected by increasing the duration of this consonant. Its average in context calculated with the initial corpus was relatively low (65ms). A finer analysis of the results of the treatment of the basic corpus showed than this consonant in situation of gemination appeared in sentences pronounced with a relatively high speed. The consonant ( سs) has being also reinforced in its duration in words or it appears geminated and followed by another سas in ﺳﺴَﺖ ُأ ﱠ. It is heard ?asasat instead of ?assasat.
makes it possible to carry out the pre-treatments necessary on the chain of entry to be synthesized and to carry out conversion graphemes to phonemes. The module of pre-treatment ensures the conversion of all the entities non-alphabetical of a text into an alphabetical string, which will be then transcribed in grapheme. The module of generation of the durations and stress is also powerful. The duration is a major determining in the perception of the phonemes (gemination of the consonant and lengthening of the vowels) and the intonation and the stress to complete the comprehension of a text.
7. References [1] T. Dutoit, V. Pagel, N. Pierret, F. Bataille, and O. Van Der Vreken, “The MBROLA Project: Towards a Set of HighQuality Speech Synthesizers Free of Use for Non-Commercial Purposes” Proc. ICSLP'96, Philadelphia, vol. 3, pp. 1393-1396. [2] Z. Zemirli, “SYNTHAR +: Arabic Text to Speech Synthesis under Multivox”, TSI, VOL17, N°6 / 98, pp. 741-761. [3] Z. Zemirli, N. Vigouroux, “Prediction of the sounds duration in an Arabic Text To Speech System”, Specom'2001, Moscow, 29-31 October 2001, pp. 205-209. [4] Baloul S., Alissali M., Baudry M., Boula de Mareüil P., (2002), “Interface syntaxe prosodie dans un système de synthèse de la parole à partir du texte en arabe” , XXIVèmes JEP, Nancy, pp. 329-332. [5] Débili F, Achour H., Souici E (2002), “the Arab language and the computer: grammatical labelling with the automatic voyellation”, Correspondences of the IRMC N° 71, pp. 10-28. [6] El-Kareh S., Al-Ansary S. (2000), “Year Arabic Multifeature Pos Tagger”, In Proceedings of the, ACIDCA conference Monastir, Tunisia, pp. 204-210. [7] Khoja S., Garside R., Knowles G (2001), “A Tagset for the Morphosyntactic Tagging of Arabic”, Proceedings of the Corpus Linguistics 2001 Lancaster University (the U.K.), Volume 13 - Special exit, pp. 341. [8] Ouersighni R. (2001), “A major offshoot of the DIINARMBC project: AraParse has morphosyntactic analyzer for unvowelled Arabic texts”, in ACL 39th Annual Meeting. Workshop one Arabic Language Processing; Status and Prospective customer, Toulouse, pp. 9-16.
6. Conclusion Concerning the global intelligibility, the ARAB_TTS system was judged of a rather good naturalness and a well acceptability. The visually impaired listener has accepted the system very well. The intonation lacks naturalness that is explained by the fact why our prosodic model does not take yet counts the micro-prosodic phenomena. The pauses are well located and perceived well. Our objective was to conceive and carry out a system of voice synthesis starting from diacritized Arab texts as understandable and naturalness as possible. The first criterion of intelligibility is reached by to the linguistic module of treatment, which
[9] M. Jomaa, “L’opposition de durée vocalique en arabe : Essai de typologie”, JEP’94, pp. 395-400. [10] Chehab A., Zaki A., Rajouani A. (2000), “Un modèle neuronal pour la prédiction de la durée des syllabes de la langue arabe”, XXIIIèmes JEP, Aussois, pp. 97-100. [11] Zaki A., Rajouani A., Najim M. (2001), “Contours intonatifs de la phrase interrogative en arabe”, XXIIIèmes Journées d’Etude sur la Parole, Aussois, 19-23 juin, pp. 249252.
979