Text-to-speech synthesis is of great interest and its applications are several. For this reason, it has interested many researchers for decades. Two methods are ...
A Text-to-Speech System for Arabic Using Neural Networks Sihem Ben Sassi, Rafik Braham, Abdelfattah B e l a t h Ecole Nationale des Sciences de l’hformatique, Tunisia Sihem.BenSassi / Rafik.Braham / Abdelfattah.Be1gbit.h 0ensi.mu.m
Abstract Text-to-speech synthesis is of great interest and its applications are several. For this reason, it has interested many researchers for decades. Two methods are usually used: synthesis by nrle and synthesis by concatenation of pre-recorded sounds. But these methods have some disadvantages such as dificulty to be adapted to a new speaker or to a new language. Recently, neural networks (NNs) have been used with non-conventional problems where a traditionul solution seems impossible. Text-toSpeech appears as one of these problems. In this field, it has been shown that NNs don’t work well when they are directly fed with speech samples. Therefore, works have been done to explore and evaluate di#erent parametric fonns of speech based on LPC, used for training, and found that LSP produced the best results. However, these methods don’t take into account residual signal and speech
produced was mchine-like and not natural. We propose in this paper to drive the NN with CELP, which provides high quality speech, to pelform Arabic speech synthesis.
Introduction Speech is the most natural and widespread form of human communication. That’s why, industrial constructors try to reinforce their machines with capabilities of hearing and speaking. Speech synthesis systems are used in many important applications and proved to be useful aids to the disabled by means of talking books and magazines for example. The way in which a word is spoken can vary greatly according to context, meaning and the state of the speaker. This unfortunately makes natural speech synthesis very difficult. Developing an unlimited text-to-speech system is an enormous task. The two main methods that are generally used are synthesis by rule [l] and synthesis by concatenation of pre-recorded speech sounds [2]. But, they require an exhaustive study, a lot of work and are language and speaker dependent, In this paper, we explore the alternative of using neural networks to perform text-to-speech synthesis of Arabic
0-7803-5529-6/99/$10.00 01999 JEEE
language through a mapping between phonetic symbols and the corresponding speech parameters. The implementation of such a system must pass through many steps, including studying of the characteristics of Arabic phonemes in order to prepare the neural network input, making decision about the unit to use for synthesis (phoneme, diphone, ...), choosing a representative corpus and segmenting it taking into account the unit chosen. The steps include also performing a speech coding through an analysis to obtain the neural network output and determining the appropriate parameters for neural network learning and test. The structure of the remaining sections is as follows. First, a review of the related work is presented followed by the main features of Arabic language. Then, we give the main methods of speech analysis. The fifth section deals with the reasons and advantageous of using neural networks in speech synthesis. Next, our system is described and a discussion of the results is given.
Related Work Traditionally, two main methods are generally used - Speech synthesis by rule [l] where the formant parameters for an utterance are interpolated between target values, tabulated for each allophone. However, to achieve good quality speech output, considerable manual effort must be expended to fine tune the tables of interpolation parameters, and concoct a set of ad-hoc rules to compensate for deficiencies of table based approach. - Speech synthesis through concatenation of pre-recorded speech unit [2] is essentially a compromise between speech quality and storage requirements. The larger the unit of speech that is used, the more effects of co-articulation are included in each unit, resulting in improved speech quality. However, when the unit is large, it becomes more specific, and so more units are required for a given vocabulary. When the unit is small, for example phoneme or diphone, it is necessary not to simply concatenate units but also to apply techniques to impose synthetic prosody by appropriate signal processing such as TD-PSOLA (Time Domain Pitch Synchronous OverLap and Add). TDPSOLA acts simultaneously and independently on the three prosodic parameters: energy, rhythm and pitch to
3030
make better speech quality, but it is known to suffer from spectral and phase distortions. These are partly due to the time-domain nature of the processing, in that the spectral envelope cannot be adequately controlled [3]. Other techniques can be used such as described in [3]. These two methods aren’t easy to implement especially the iirst and they are not easily adapted to a new speaker if we want to change the voice of synthesised speech, let alone a new language.
rate coding. These properties were found to be useful in NNs training and to give the best results among those obtained by other methods. Line spectral pair coding records the frequency of the zeros of two polynomials P(z) and Q(z) related to the predictor polynomial A(z) by the following equations: P(z)= A, (z)- Z - ( ~ + I )(z-’) A~
Main features of Arabic language
Q(z)= A, (z)+ Z - ( ~ + I ) A(z-’) , Synthesis filter can be reconstructed thanks to the following equation:
Arabic language is essentially consonantic; it comprises 28 consonants and 3 vowels. It presents some specific characters such as “grassayement” and “emphatisation” [4]. Consonants are classified according to the articulatory features they present: - phonetic type: plosive, fricative, nasal, etc. - place of articulation: labial, dental, palatal, etc. - organ of articulation: apical, dorsal, labial, etc. - character: emphatic, aspirant, chuintant, etc. The three vowels are characterised by two classes of localisation: bacwfront and open/closed, besides the duration: shodlong. It is necessary to notice the importance of the context in speech: the influence of one phoneme on another. In fact, the same phoneme picked from two different words would not be pronounced in the same way. This depends generally on the characteristics of the neighbouring phonemes that are in the same word; rarely this influence exceeds from one word to another.
Speech resulted from an LPC analysishynthesis is not of good quality and is machine sound like since LPC provides only the spectral envelope. Therefore, the residue contains important information about how speech should sound, and LPC synthesis without this information results in poor speech quality. For best quality results, various attempts have been made to encode the residue signal in an efficient way. The most successful methods called CELP use a codeboob, a table of typical residue signals. Jn operation, the analyser compares the residue to all the entries in the codebook, chooses the entry which is the closest much, and just sends the code for that entry. The synthesiser receives this code, retrieves the corresponding residue from the codebook, and uses that to excite the filter; that was the origin of CELP: Codebook Excited Linear Prediction. This last method results in a high quality speech.
LPCI CELP Speech Coding Linear predictive coding LPC [51 provides an automatic and computationally efficient coding technique. It performs a linear prediction of the next speech sample as a weighted sum of the p past samples:
The three figures (1,2 and 3) below represent respectively the waveform of an original sentence followed by the waveform of the same sentence obtained by an LPC analysis/synthesis and that obtained by a CELF’ analysidsynthesis. We can easily remark the poor quality of LPC synthesized speech and the human-like CELF’ synthesized speech.
i=l
which presents an all-pole filter. The corresponding transfer function is: 1 1 H(z)=-= p . A(z) 1 ai z-’
-c
Figure 1:The original sentence.
i=l
Unfortunately,the filter coefficients ai exhibit an extremely high degree of spectral sensitivity. But fortunately, a number of equivalent formulations have been developed with less spectral sensitivity and more interpolation properties [6].We can list auto-correlation method, partial correlation, log area ratio coding, line spectral pair representation, etc. Works had been done [7], [8] to prove and confirm that LSP representation has an excellent quantization and interpolation properties for use in low bit
Figure 2: The sentence obtained after an LPC synthesis.
303 1
Figure 3: The sentence obtained after a C E L ~synthesis.
Speech Synthesis and Neural Networks NNs are proven to be appropriate tools for problems needing a sophisticated smoothing algorithm when there are insufficient training examples to populate the input space, or for problems where the form of the solution is unknown and there is no obvious parametric way to deduce it. They are also appropriate in case where a series of transformations can exploit high order correlation in the input data [9]. Speech synthesis appears to be one of these problems. On the other hand, NNs are trained from actual speech samples then they have the potential to generate more natural sounding than other synthesis technologies. While a rule based system requires generation of language dependent rules, a neural network based system is directly trained on actual speech data. Therefore, it is language independent, provided that a phonetically transcribed database exists for a given language. concatenative systems can require several megabytes of data, which may not be feasible in a portable product. The NNs can generate a much more compact representation by eliminating the redundancies present in the concatenative approach [101. Thanks to all these properties, NNs seem suitable for speech synthesis [7], [91, [lo].
generalization; so phonemes and diphones can be indifferently used. Making a choice was not clear and we decided to use both types of units and compare the results. Recorded sentences were then segmented. We obtained a database of about 1444 phonemes and a other database of 1524 diphones. The next design step was to define the input and output of the neural network. First, the transcription of the text of the chosen lists into phonemic represenkon was performed. Then, each phoneme was transcribed into the corresponding articulatory features besides its word position. This results on a binary vector of 36 characteristics: each number is equal to 1 if the phoneme verifies this characteristic and to 0 if not. Our neural network must perform the task of mapping phonemic representation to the corresponding speech parameters. These parameters are generated through a CELP analysis of the units. CELP analyses the unit by frame with 30ms duration. To each frame corresponds a vector of 36 hexadecimals numbers. From CELP analysis and the transcription into articulatory features of phonemes, neural network input and output files can be generated. Figure 4 below illustrates these different steps.
synthesis
Analysis by . .
I NNs
I
In Arabic, neural networks have not yet been applied to speech synthesis. A system based on concatenation of diphones bas been implemented. It uses TD-PSOLA for prosody modification. Quality of speech obtained is not fully satisfactory. More effort is thus needed to improve quality-
analysis /
System Description
Articulatory
Signal
fames
segmenting
transcription
To perform text-to-speech synthesis for Arabic language, it is necessary to have a database of recorded units to be used in training. S i n e there is no database available, we have chosen from phonetically balanced lists as described in [ll]. Each one is formed by ten sentences. After recording sentences, obtained signal must be segmented. It was difficult to decide which synthesis unit, phoneme or diphone, should be used. Diphones include transition between phonemes, thus co-articulation phenomenon is taken into account by units. On the other hand, NNs have been shown to model co-articulation between phonemes thanks to the property of interpolation and great capacity of
W
Figure 4: Basic architecture of our system
Network Architecture and results As we have mentioned above, phonemes and diphones were both tested. But using phonemes imposes tacking into account the effect of context by providing phonemes
3032
preceding and following the current one. The question here is how many phonemes preceding and how many following must be considered? We tested and compared results of 0, 1 , 2 and 3 right and left context units. A network architecture similar to that used in NetTalk was employed. The input layer forms a sliding window over the input stream of units. It consists of one group of neurones representing the unit to synthesize and 2n context units: n to the left and n to the right. Each phoneme is represented by a vector of articulatory features, in this case n E { 0,1,2, 3). Each diphone is represented by the articulatory features of both the two phonemes, and in this case n E { 0,1}. The neural network must perform a mapping between the articulatory features and the corresponding CELP parameters. These parameters are output frame by frame. An additional input neurone acts as index by indicating to the network how many frames must be produced. We have used a variant of back-propagation algorithm to train the network and applied a sigmoidal function. The number of neurones in the hidden layer was varied during tests. Figure 5 presents results obtained. It shows the minimum RMS error obtained during training and test for each type of input. We can see that using phonemes or diphones has not a great effect on results. I
I
dipho
diphl
phon0 phon1 phon2 p h d
synthesis unit L
Discussion The Results obtained in this way are not fully satisfactory.
Tbis can be attributed in part to the size of the database. But the neural network architecture itself had to be reviewed. It seems too simple to handle the complex task we had. We proposed then to use a set of neural networks, one for each phoneme. CELP parameters obtained from each network must be synthesised and signals resulted must be assembled to form the word or the sentence.
This new architecture was tested with the sentence of Figure 1. The waveform obtained as shown in Figure 6 is comparable to the best one that can be obtained (Figure 2).
Conclusion In this paper, we have explored the use of neural networks in speech synthesis and particularly for Arabic language. We have found that NNs seem to be very promising since they can provide a quality speech provided that they are fed with quality speech parameters. This can be done by using CELP method. More effort is needed to complete this work by enlarging the database and optimising the neural network architecture. Nevertheless, some questions such prosody remain open for research.
References [l] Holmes J.N. et al. “Speech synthesis by rule’’; Language and speech 7, pp 127-143,1964. [2] Emerard F. “Synthbse par diphones et traitement de la prosodie”; Thbse troisibme cycle, Universitk des langues et des lettres, Grenoble 1977. [3] Edgington E. et al. “Residual-based speech modification algorithms for text-to-speech synthesis”; the fourth International Conference on Spoken Language Processing, 1996. [4] Djoudi M. “Contribution 9 I’ktude et Zi la reconnaissance automatique de. la parole arabe standard”; Thbse de doctorat, Universit6 de Nancy I, 1993. [5] Rabiner L.R. et al. “Digital processing of speech signals”; Prentice-Hall. Inc., 1978. [6] Robinson T. “Speech analysis”; Lent Term, 1998. [7] Cawley G.C.“The application of neural networks to phonetic modelling”; Phd thesis, University of Essex, March 1996. [8] Phamd N.C. “Coding of speech LSP parameters using tree-searched vector quantization”; Thesis report, Master degree, Faculty of the graduate school, University of Maryland, 1998. [9] Tuerk C. et al. “Speech synthesis using neural networks trained on cepstral coefficients”; Cambridge university engeneeringdepartment, Cambridge, England, 1993. [lo] Kaarali 0. et al. “Speech synthesis with neural networks”; World Congress on Neural Networks, San Diego, pp 45-50, September 1996. [1 13 Boudraa M. et al. “Elaboration d’une base de dOMkeS Arabe phonetiquement 6quilibrke”; Actes du colloque du 11 congrbs, Langue Arabe et technologies informatiquesavancks, Casablanca, Dkembre, 1993.
Figure 6 Sentence obtained after a neural synthesis
3033