A SPEECH SYNTHESIZER FOR PERSIAN TEXT USING A NEURAL NETWORK WITH A SMOOTH ERGODIC HMM F. Hendessi*, A. Ghayoori* and T.A. Gulliver** *Department of Electrical & Computer Engineering, Isfahan University of Technology, IRAN **Department of Electrical & Computer Engineering, University of Victoria, Victoria, BC CANADA
Email:
[email protected] Abstract The conversion of text into speech using an inexpensive computer with a minimal amount of memory is of great interest. Speech synthesizers have been developed for many popular languages (e.g. English, Chinese, Spanish, French, etc.), but designing a speech synthesizer for a language is largely dependant on the language structure. In this paper, we develop a Persian synthesizer that includes an innovative text analyzer module. In the synthesizer, the text is segmented into words and after preprocessing, is passed to a neural network. In addition to preprocessing, a Smooth Ergodic Hidden Markov Model (SEHMM) is used as a post-processor to compensate for errors generated by the neural network. The performance of the proposed model is verified and the intelligibility of the synthetic speech is assessed via listening tests. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence] Natural Language Processing---speech recognition and synthesis of Persian; I.2.6 [Artificial Intelligence] Learning---connectionism and neural nets 1. INTRODUCTION The development and application of text to speech synthesis technology for various languages are growing rapidly [9][10]. Designing a synthesizer for a language is largely dependant on the structure of that language. In addition, there can be variations (dialects) particular to geographic regions. Designing a synthesizer requires significant investigation into the language structure or linguistics of a given region. In this paper we concentrate on designing a speech synthesizer for the most common Persian dialect. The English language has some advantages over Persian that make the synthesizer design much easier. For example, English vowels always appear in written words, whereas in Persian long vowels are written while short vowels are omitted. In addition, since some letters can appear both as a consonant and a vowel (long vowel) in Persian, even distinguishing the long vowels is not an easy task. Thus the first and most important step in designing a Persian synthesizer is to determine the phonetic spelling of each word in the text. Persian phonetic spelling extraction is not as complex as for English, and there is a correspondence between standard phonetic spelling and diacritic orthography. There is no major difference between extracting the vowels and the phonetic spelling of a word, and both need the same amount of processing. There are several ways in which Persian words can be translated into phonetic spelling. The first and most common method is to assign an equivalent phonetic spelling to each word and store these assignments in a lexicon. With this technique, a database is searched to find the phonetic spelling corresponding to each word. This is known as the dictionary method [1]. Several commercial methods for text to phoneme conversion rely on an exhaustive dictionary approach. However, not all words are included in the dictionaries and therefore the phonetic spelling performance is compromised. A common approach to mitigate this problem is to search the database for the best match. Another approach is to use a probabilistic model designed for the language. A third approach uses a neural network to obtain the Persian phonetic spelling structure. For this purpose, training data should be gathered that is representative of the language. Using a neural network for the phonetic spelling extraction in text-to-speech conversion creates a fast, distributed recognition engine that supports the mapping of words missing in the database. The dictionary method has the advantage of being accurate, but it needs a huge database that consumes significant resources, and access time for such a database can be considerable. In the second and third methods some errors can be expected, but the access time is considerably lower (suitable for real-time applications), and the resources needed are negligible in comparison with the dictionary method. Since there is a lack of accurate probabilistic information about the Persian language, the probabilistic method
1
cannot be used independent of other procedures to model the language. Because of the complexity of the phonetic spelling extraction process in Persian, using a neural network to extract all of the required structures leads to unacceptable accuracy. So a combination of the second and third methods in addition to utilizing some Persian language rules is reasonable for implementation. In this paper we present a composite method with the following stages: word processing, sentence processing, and speech generation. 2. WORD PROCESSING The first step in designing a Persian synthesizer is to determine the phonetic spelling of each word. This is equivalent to checking each letter to see if it is a vowel. There are three short and three long vowels in Persian, and every short vowel has a long vowel equivalent. In this paper, the phonetic spelling extraction is done using a two layer Multi Perceptron neural network (MLP), as shown in Fig. 1. This is based on the pioneering work of Sejnowski and Rosenberg, i.e., NetTalk [7]. NetTalk uses a single neural network to deal with all phoneme cases. The number of nodes in the input, hidden and output layers is 54, 300 and 3, respectively. The activation function that is used for creating non-linearity is sigmoid.
Figure 1. The MLP neural network. There are many learning algorithms for training a neural network. One of the most common algorithms Back Propagation (BP), is employed here. The learning rule in the BP algorithm was chosen to be a 3valued function with respect to the error at the output nodes. After the neural network is trained, it may produce errors in the phonetic spelling extraction of an unknown word. Some of these errors have a very negative impact on the pronunciation of the word. To deal with these cases, we use a Smooth Ergodic Hidden Markov Model (SEHMM), which will be described later in the paper. 2.1. Preprocessing Before the words enter the neural network, some preprocessing is useful. Because of the complexity of the phonetic spelling extraction process, applying some Persian language rules can greatly increase the extraction accuracy. For example, if certain pairs of letters appear in a particular position in a word, there is no ambiguity about the phonetic spelling, so it can be determined using a rule database. This database specifies the positions and their corresponding phonetic spelling. These rules can also be incorporated into the neural network, but this increases the amount of information and rules that should be learned by the neural network. Thus using these preprocessing rules decreases the resources required to achieve a desired accuracy. Some of the words in Arabic are widely used in Persian. Arabic is a rule based language in which the phonetic spelling extraction process obeys known rules. Thus Arabic rules are also stored in the database to easily obtain the phonetic spelling of these words. There are some exceptions to the Persian rules, which means there are words for which the rules cannot be applied. These words and their corresponding phonetic spelling are stored in a dictionary. Preprocessing is also used to distinguish the suffix and/or prefix of each word in the text. There are many suffixes and prefixes in the Persian language, and they should be separated from the original word before the word enters the neural network. These suffixes and prefixes result in many different words. Thus by learning just a main word, the neural network can correctly extract the phonetic spelling of its variations.
2
When a suffix and/or prefix is identified, it is separated from the word as the last stage of preprocessing. The word is now ready to enter the MLP.
2.2. Training the neural network A file containing the most common Persian words, along with their phonetic spelling, is used to train the neural network. Before the neural network begins to process the words, each letter is mapped into a 6-bit string (There are 32 letters in the Persian alphabet plus some signs, so six bits are required to represent all possible inputs). The complete mapping is given in Table 1. Table 1. The mapping from 6-bit strings to the Persian alphabet.
bbbbbb bbbbbb bbbbbb
bbbbbb bbbbbb bbbbbb
000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110
ا ب پ ت ث ج چ ح خ د ذ ر ز ژ
010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000
ط ظ ع غ ف ق ک گ ل م ن و ﻩ ی
001111
س
100001
ﺁ
010000
ش
100010
010001 010010
ص ض
bbbbbb bbbbbb 010000
bbbbbb 010000 011111
010000 011111 001100
011111 001100 001100 bbbbbb bbbbbb bbbbbb
ّ
bbbbbb bbbbbb bbbbbb
bbbbbb bbbbbb bbbbbb
Figure 2. NN learning sequence for the word “shahr”. Nodes between the input layer and the hidden layer, and between the hidden layer and the output layer, are fully connected.
3
The six-bit representation of each letter enters the neural network along with the representations of the four adjacent right and four adjacent left letters. For letters located near the ends of the words, blanks are placed in the empty positions. This ensures that the neural network only operates on isolated words. We consider a sliding window, similar to NetTalk, with the letter to be vowelized placed in the middle of the window, i.e. Central Window Positioning [7]. In the alignment presented in [6], the window positioning structure is Second Position Asymmetric Windowing (SPAW), with an unequal number of places before and after the letter under consideration. This structure is called “Second Position” because this letter is in the second position from the center of the window. This structure is described in detail in [7]. The neural network inputs for learning a typical word are shown in Fig. 2. The number of adjacent letters is 8. This number of adjacent letters is the minimum necessary to determine the vowel of a letter without ambiguity in Persian. Signs are also used in Persian (e.g. “tashdid” (gemination), and “tanween”). One of these signs, tashdid, is used when two similar letters come together and the first has no vowel while the second has a short vowel. In this case the first letter is omitted and tashdid is placed on the second letter. When this sign is encountered, the omitted letter is restored and the tashdid sign eliminated. In this case the first letter does not have a vowel and so doesn’t need to be processed by the neural network, since the output is the vowel of the letter. A letter can have a short vowel or a long vowel, so the output of the neural network should distinguish between these as well as the case when the letter has no vowel. The neural network output should also help to determine any exceptions (these will be discussed later). The following procedure is used by the neural network to identify long vowels. If a letter is followed by one of three special letters ““ اand ““ وand “”ی, it is a candidate for a long vowel. This is determined by whether these three letters appear as a consonant or not. When these letters appear as consonants, they do not indicate a long vowel for the previous letter. Examples are given in Fig. 3. If the letters اand وand یindicate long vowels for the previous letter, the neural network generates special output codes to indicate the presence of a long vowel. In these cases the neural network doesn’t process the letters ا, و,and ی. ( ﺗﻮﺟﻪ/tævæj /) Input=bbbbbb bbbbbb bbbbbb bbbbbb 000100 011110 000110 011111 bbbbbb Input=bbbbbb bbbbbb bbbbbb 000100 011110 000110 011111 bbbbbb bbbbbb Input=bbbbbb bbbbbb 000100 011110 000110 011111 bbbbbb bbbbbb bbbbbb ( ﺗﻮپ/tu:p/) Input=bbbbbb bbbbbb bbbbbb bbbbbb 000100 011110 000011 bbbbbb bbbbbb
Output=011 Output=011 Output=010 Output=100
Figure 3. Neural network (NN) processing of the two words (tavajoh) and (toop). For the first word, the output of the NN for the first letter indicates a long vowel. For the second word, the output of the NN indicates a short vowel.
The output of the neural network is also used to indicate some exceptions. There are some letters (in a special order) in Persian with an orthography that does not correspond to their phonetic structure (e.g. ) ﺧﻮا, but only in certain special words. Otherwise these letters are pronounced normally. When the neural network encounters these exceptions, it alerts the system which then labels these exceptions so they can be pronounced correctly. For example, consider the two words “( ”ﺧﻮاهﺮ/xa:hær/) and “(”ﺧﻮاص/xæva:s/). For the vowel of the letter “ ”خin the first word, the neural network outputs 011 (indicating the short vowel /æ/). However, for the first letter of the second word, the neural network indicates the presence of an exception. This alerts the system that in the second word, the 2nd and 3rd letters indicate a long vowel “ ”اso they can be treated properly. The complete list of neural network outputs is shown in Table 2.
4
Table 2. The neural network outputs. 001 010 011
-ِ ُ-َ
100
Long vowel indicator
101
Special case indicator “Sokoon”
110
/e/ /u/ /æ/
3. PHONETIC SPELLING EXTRACTION USING A NEURAL NETWORK When the neural network is trained correctly, it can be used to recognize the phonetic spelling of almost all words. First the prefixes and suffixes are separated from each word, and then the neural network processes each word, letter by letter. While the neural network moves from one letter to another, the Persian rules (mentioned in Section 2) are checked. If one of the rules is applicable, the neural network ignores the corresponding letters. These rules decrease the amount of processing done by the neural network in both the training and testing stages. They also increase the accuracy of synthesizing unknown words (those the NN has not been trained for). When the neural network encounters an unknown word, the output (which is based on the training data), may be an unpronounceable word. The system should have a mechanism to address these situations in order to produce a meaningful phonetic spelling. In this paper, a 33 state Smooth Ergodic HMM (SEHMM) is employed to handle these cases. A Hidden Markov Model is a Markov model in which the observation is a probabilistic function of the state. The resulting model is a doubly embedded stochastic process in which the underlying process is not directly observable [4]. In the SEHMM model, the observation probabilities of the SEHMM are conditioned on the current state of the SEHMM, as well as the preceding and following states. Knowing these three states, the probability distribution of the observations can be determined. The SEHMM for the discrete symbol observations model is characterized by the following parameters. First, the number of states in the model is 33 because the size of the Persian alphabet is 32 (blank is a separate state). Define Q as the set of states that are involved in specifying the observation probability function of the SEHMM at time i
Q = (qi −1, qi , qi +1 ) qi = j 1 ≤ j ≤ 33 The other parameter is the number of distinct observation symbols ( v ), which is 6 in our model; three for short vowels, one for no vowel, and two for long vowels and exceptions. Thus the observation vector for the HMM is
V = {v1 , v 2 , v3 , v 4 , v5 , v6 } As mentioned before, the type of HMM that was selected is a Smooth Ergodic HMM or a fully connected SHMM (Smooth HMM). In a SEHMM, every state of the model can be reached (in a single step) from every other state as shown in Fig. 4. This model is practical for any language, because it assumes that every letter is reachable from every other one. This assumption is very close to reality, and enables the model to adapt itself to any new words that are added to the language.
5
The observation symbol probabilities blji (k ) are obtained using statistical information about the vowel of the letter conditioned on the preceding and future letters
blji (k ) = P(ot = v k / qt −1 = l , qt , = j , qt +1, = i ) T
P (O / Q, λ ) = ∏ P(ot / q t −1, qt , q t +1 , λ )
where q 0 = Blank
t =1
= bq1 (o1 )bq2 (o2 )....bqT (oT )
For our purposes, this information was gathered by counting the number of occurrences in a dictionary considering only the first letter of each word. In order to generate the most probable observations for a state sequence, the probability of a word occurring should be a factor in computing the probability of the observation. This probability is employed in counting the occurrences used in calculating the probabilities according to ∑ Pw w∈ ( o t = v k ) , b lji ( k ) = P ∑ w w
where w is a word with state sequence Q and Pw is the probability of w. Any of the estimation algorithms discussed previously can be used with the standard ML criterion along with a training sequence of observations O to derive the set of model parameters λ , yielding
λ ML = arg(max p(O | λ )) . λ
The trained SEHMM is used to determine how to treat critical errors that occur during operation of the neural network. In Persian, if the first letter of the word is not followed by a vowel (long or short vowel), it cannot be pronounced. To handle these situations, the SEHMM generates a vowel for the first letter according to the observation probabilities stored in a table. By looking at the table of observation probabilities, conclusions can be deduced in many cases. For example there are some pairs of letters that can appear at the beginning of a word such that the first letter can only accept a special vowel. These cases don’t need any further processing since the vowel of the letter can be determined without ambiguity. This statistical information (which was gathered from a much larger database than used for training the neural network), can be used to choose the most probable vowel for the beginning letter.
Figure 4. A 4-state ergodic hidden Markov model. The other critical error that is likely is the case when three or more successive consonants have no assigned vowels at the neural network output. In this case the SEHMM generates a vowel for the middle consonants.
6
A flowchart of the complete phonetic spelling extraction process is given in Fig. 5. 4. SENTENCE PROCESSING In the English language, the pronunciation of a word doesn’t differ according to the sentence it appears in. However, in Persian the pronunciation of a word may differ slightly according to the sentence because of the vowel of the last letter in each word. The vowel of the last letter in a Persian word depends on the function of that word in the sentence. The last letter of a noun can have the vowel /e/ or “Sokoon”. The vowel of the last letter in other words (e.g. verbs, prepositions, conjunctions …) is always “Sokoon”. Thus determining the vowel of the last letter requires some grammatical processing.
Input Text Search the dictionary for predetermined combinations of letters Segmentation into words Process each word with the NN Search the dictionary to identify a suffix and/or prefix, remove from the word
Put letters and their vowels together
Add a suffix and/or prefix as required Search the dictionary for exceptions (words that do not conform to the rules) Identify unpronounceable outputs and process them Check rules (there are rules that can specify the vowels in some words without processing)
Phonetic Spelling of Persian Word Figure 5. A flowchart of the phonetic spelling extraction process. 5. SPEECH GENERATION Within the framework of an unlimited vocabulary speech synthesizer, rule based transcription systems have been developed for many languages. Since the use of stored speech as words or phrases is impractical (because of the huge number of words and phrases), the production of artificial speech, independent of any utterance, is useful. There are many applications that can profit from such systems, such as electronic mail readers and talking computers. The following three approaches, articulatory synthesis, formant synthesis and concatenative synthesis, have a long history. The simplest approach is concatenative synthesis, since it takes segments of recorded speech and concatenates these segments during synthesis. Some systems use recorded speech coded by linear predictive coding (LPC). Other systems perform synthesis directly in the time domain (e.g., PSOLA) by storing and concatenating waveform segments. In general, the speech quality of systems that use time-domain techniques is higher than that of LPC-based systems. This section considers the design of a text to speech system that accepts phonetic spelling as an input and generates speech as an output.
7
As mentioned above, word or phrase synthesis is best in terms of complexity and the quality of the synthesized speech. However, this method cannot be used for a large vocabulary since an impractical amount of memory would be required. In contrast, using phonemes and allophones for synthesis produces low quality speech. In addition, the design of such a system is very complex, but the resources required are negligible in comparison with the first method. The procedure developed in this paper uses a moderate sized inventory of sub phonetic segments of short duration. These segments comprise the required steady state sounds and the transitions between these sounds. The resulting synthetic speech is highly intelligible and is of good quality. Fig. 6 shows the complete text-to-speech conversion process. 5.1 Synthesis Units and Phonemes As explained previously, there are six Persian vowels, three short and three long. These vowels cannot be the first letter, but can occur between two consonants or at the end of a word. The 32 Persian consonants can occur anywhere in a word. There are vowel-consonant and consonant-vowel transitions in Persian. Any of the 32 consonants can occur before any of the six vowels, but there is no phonetic difference between the consonants given in Table 3. Table 3. The 6 sets of consonants that are phonetically similar. ت ط ضظزذ صسث عا حﻩ قغ
/t/ /z/ /s/ /?/ /h/ /q/
8
Start
End
Create a string of words from the input text
Concatenation rules
More words in the string?
N Synthesis unit generation rules
Y
Exceptions lexicon
Y Generate a proper string of diactritic words
N Check Persian rules
Y
Generate a proper string of diactritic words Phonetic Spelling Extraction Process
N
Figure 6. The Text-to-Speech Conversion Process
9
5.2 Converting Text to Its Equivalent Phonetic Spelling The text to phonetic spelling conversion consists of two stages. The first stage converts exceptional words and abbreviations to phonemes by examining the pronunciation dictionary. The strings of numbers must also be converted to words. The last stage is the phonetic spelling extraction for all remaining words. The exception lexicon contains the spelling and equivalent phonetic spelling for Persian nouns whose pronunciation does not follow the regular rules (as mentioned previously, Persian rules are used to improve the performance of the system). These rules and the phonetic extraction rules have some exceptions that must be stored in a database. Words having a fixed pronunciation, abbreviations and some frequently occurring words are added to the database to speedup the conversion process. The lexicon is not formulated in rule notation, but rather has a simple table structure. Strings that are matches in the lexicon are converted and passed on to the speech generation part of the system. Persian phonetic spelling extraction is not as complex as for English, and there is a correspondence between standard phonetic spelling and diacritic orthography. However, in Persian there are diacritic symbols that are used to represent morphophonemic processes of the language. Examples of these are “tanween”, and “tashdid” as explained previously. The diacritic symbols “Fathe”, “Zamme”, and “Kasre” are used for the short vowel sounds /a/, /u/, and /i/, respectively. The diacritic symbols of the three different types of “Tanween”,” ً , ٍ , ٌ “ are pronounced as the phoneme sequences /an:/, /un:/ ,and /in:/, respectively. The Persian long vowels (/a:/, /u:/, and /i:/) are not represented in the orthography by diacritic symbols. Instead, they have to be inferred from the written context of the letters for “Alef” , “Waw” and “Ye”. The long vowel generation rules are as follows. The long vowel /a: / is produced when the letter symbol for “Alef” is preceded by a consonant. The long vowel / u: / is produced when the letter symbol for “Waw” is preceded by a consonant. This consonant should not have any diacritic symbol. The long vowel /i: / is produced when the letter symbol for “Ye” is preceded by a consonant. This consonant should not have any diacritic symbol.
Long duration allophone symbol [z:] [t:] [z:] [q:] [f:] [q:] [k:] [g:] [l:] [m:] [n:] [w:] [h:] [y:]
Phonemic symbol /z/ /t/ /z/ /?/ /q/ /f/ /q/ /k/ /g/ /l/ /m/ /n/ /w/ /h/ /y/
Persian Symbol ض ط ظ ع غ ف ق ک گ ل م ن و ﻩ ی
Long duration allophone symbol [b:] [p:] [t:] [s:] [j:] [č:] [h:] [x:] [d:] [z:] [r:] [z:] [ž:] [s:] [š:] [s:]
Phonemic symbol /b/ /p/ /t/ /s/ /j/ /č/ /h/ /x/ /d/ /z/ /r / /z/ /ž/ /s/ /š/ /s/
Persian Symbol ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص
Table 3. The phonetic symbols of Persian sounds
10
Short vowels are included in the text as diacritic symbols. The correct inclusion of short vowels requires the introduction of morphologic and syntactic levels, which is done by the neural networks with the SEHMM. The Synthesis units in the system are diaphones. The final step is to concatenate the synthesis units into wave files. 6. TD-PSOLA The speech synthesis module uses the time domain pitch synchronous overlap add (TD-PSOLA) method [13] to synthesize the synthesis units, based on a waveform dictionary. Finally the TTS system concatenates the synthetic waveforms, and delivers the result to the playback device. Before synthesis, however, the phonemes must have their duration and pitch modified in order to fulfill the prosodic constraints of the new words containing these phonemes. This processing is necessary to avoid the production of a monotonous sounding synthesized speech. To allow these modifications in the recorded subunits, many concatenation based TTS systems employ the TD-PSOLA method. With this method, the speech signal is first processed by a pitch marking algorithm. This algorithm assigns marks at the peaks of the signal in the voiced segments. The synthesis is made by a superposition of Hanning windowed segments centered at the pitch marks and extending from one pitch mark to the next. Duration modifications are provided by deleting or replicating some of the windowed segments. The pitch period modification is provided by increasing or decreasing the superposition between windowed segments. These modifications are illustrated in Fig. 7. 6.1 Pitch Marking One of the most popular techniques for detecting pitch is to use autocorrelation analysis on speech that has been appropriately preprocessed [12]. Autocorrelation methods have met with good success for several reasons. The autocorrelation is applied directly to the waveform and is computationally straightforward. This method is amenable to digital hardware implementation, generally requiring only a single multiplier and an accumulator as computational elements. The autocorrelation method is largely phase insensitive [11]. Thus it is a good method for detecting the pitch of speech that has been transmitted over a telephone line, or has suffered some degree of phase distortion via transmission. Although using an autocorrelation for pitch detection has some advantages, there are also several problems which are explained and discussed in [11]. One is the problem of choosing an appropriate analysis frame size. The ideal analysis frame should contain 2 to 3 pitch periods, so the size of the window can vary from 5ms for high-pitched speakers to 50ms for low-pitched speakers. In this work, the pitch period of the synthesis units was estimated, and this was found to provide acceptable performance, so this problem was not an issue here. To partially eliminate the effects of the higher formants on the autocorrelation function, most methods use a lowpass filter with cutoff frequency around 900Hz. This will, in general, preserve a sufficient number of pitch harmonics for accurate pitch detection, but will eliminate the second and higher formants.
11
Figure 7. TD-PSOLA synthesis: (a) increase in duration (b) decrease in duration (c) increase in pitch
The correlation function operates on a short segment of the signal [11]
φl (m) =
1 ∑[x(n + l)w(n)][x(n + l + m)w(n + m)]; 0 ≤ m ≤ M 0 − 1 , N
where w(n) is an appropriate rectangular window for analysis (i.e. w(n) =1, 0 ≤ n ≤ N − 1 & w(n) =0, otherwise) , N is the length of the signal being analyzed, N ′ is the number of signal samples used in computing φ l (m) . M 0 is the number of autocorrelation points to be computed and l is the index of the
starting sample of the frame. For pitch detection, N ′ is always set to N-m. To reduce the effects of the formant structure on the autocorrelation shape, two preprocessing functions are used prior to the autocorrelation computation [11]. Fig. 8 shows a block diagram of the processor that was used.
12
Figure 8. Block diagram of the non-linear correlator. 7. SYSTEM IMPLEMENTATION To demonstrate the stability of the Persian synthesis, the text-to-speech conversion system was implemented on a personal computer (PC). The system was programmed to do the text-to-speech and phoneme-to-speech modules. A total of 322 synthesis units were extracted beforehand and stored on the PC. The concatenation of the synthesis units was carried out in a speech buffer. 7.1. System Performance To evaluate system performance, a test text containing 800 words was employed. To demonstrate the understandability of the synthesized speech, a closed evaluation test was conducted in which users had access to the original text. In addition, an open evaluation test was done to evaluate the perceptual confusion as in [3]. The test subjects were students whose native language is Persian, but who had no previous experience with synthesized Persian speech. An audiotape of stimulus words was prepared that contained the synthesized words. The test text was played only once, and the listeners were not allowed to review the tape. Subjects were tested individually, and they were informed that they would hear the synthesized text. In the closed evaluation test, subjects were told to score the text words they heard. Three different scores were assigned to each test word: A: The pronunciation can be understood without ambiguity B: The word has an intelligibility level much higher than C, but not as good as A C: The pronunciation is difficult to understand Results of the closed evaluation tests for the system without a postprocessor stage are shown in Table 4. This table gives the average percentage of correct responses, in addition to the understandability percentage for each listener. It is clear that not handling unpronounceable words results in poor performance as intelligibility is well below 90%. Another closed evaluation test was performed with the complete system, including post processing. The results of this test are depicted in Table 5, and show that post processing increases the intelligibility by over 15 %. In the open evaluation test, subjects were asked to write down what they hear. In this test the intelligibility percentage decreased as expected. In this case no list of words or text was given to the listener. The evaluation results show that the listeners wrote 89% of the words correctly. Repeating the test for people who were previously exposed to synthesized speech resulted in better performance.
13
Table 4. Results of the closed evaluation test without postprocessing.
Listener 1 Listener 2 Listener 3 Listener 4 Listener 5 Listener 6 Listener 7 Listener 8 Listener 9 Listener 10 Average
A
B
80.1% 78.75% 85.1% 80.7% 83.2% 81.2% 79.2% 82.3% 81% 80.1% 81.2%
7.6% 5.25% 8.4% 3.5% 5.1% 5.3% 5.3% 3.1% 5.0% 5.3% 5.4%
A+B (Intelligibility percentage) 87.7% 84% 93.5% 84.2% 88.3% 86.5% 84.5% 85.4% 86% 85.4% 86.6%
C 12.3% 16% 6.5% 15.8% 11.7% 13.5% 15.5% 14.6% 14% 14.6% 13.4%
Table 5. Results of the closed evaluation test with postprocessing.
Listener 1 Listener 2 Listener 3 Listener 4 Listener 5 Listener 6 Listener 7 Listener 8 Listener 9 Listener 10 Average
A
B
A+B (Intelligibility percentage)
C
96.3% 95.75% 94.1% 98.1% 96.1% 93.25% 93% 92.3% 95% 96.1% 95%
2% 2.25% 3.4% 1.5% 1.5% 4.25% 4% 3.5% 3.9% 2.5% 2.88%
98.3% 98% 97.5% 99.6% 97.7% 97.5% 97% 95.8% 98.9% 98.6% 97.88%
1.7% 2% 2.5% 0.4% 2.4% 2.5% 3% 4.2% 1.1% 1.4% 2.12%
8. CONCLUSIONS In this paper a system was designed for Persian phonetic spelling extraction using a neural networks. When unknown words are encountered, the neural network may cause critical errors that can make the word unpronounceable. To handle these cases, post-processing using a SEHMM was employed. It was shown that the SEHMM improves system performance by 10-15%. A synthesizer was designed to test the phonetic spelling extraction process. The synthesizer accepts an input diacritic text (vowelized text) and produces output speech by concatenating speech synthesis units. The intelligibility of this system was shown to be satisfactory and thus is a good substitute for dictionary based phonetic spelling extraction systems. This system can easily be applied to Arabic (which is very similar to Persian). However, the system parameters (e.g. SEHMM observation function, neural network weights and extraction rules), are language dependant and should be determined separately for each language.
References [1] L.-S. Lee, C.-Y. Tseng and C.-J. Hsieh, “Improved tone concatenation rules in a formant-based Chinese text-to-speech system,’’ IEEE Trans. Speech and Audio Proc., vol. 1, no. 3, pp. 287-294, July 1993.
14
[2] W.A. Ainsworth, “A system for converting English text into speech,’’ IEEE Trans. Audio and Electroacoustics, vol. 21, no. 3, pp. 288-290, June 1973. [3] H. Selim and T. Anbar, “A phonetic transcription system of Arabic text,’’ IBM Cairo Scientific Center Tech. Report No. 25, Aug. 1986. [4] L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,’’ Proc. IEEE, vol. 77, pp. 257-286, 1989. [5] Y.A. El-Imam, “An unrestricted vocabulary Arabic speech synthesis system,” IEEE Trans. Acoustic, Speech and Signal Proc., vol. 37, pp. 1829-1845, Dec. 1989. [6] M.J. Embrechts and F. Arciniegas, “Neural networks for text-to-speech phoneme recognition,’’ Proc. IEEE SMC Conf., pp. 3582 – 3587, Oct. 2000. [7] T.J. Senjnowski and C.R. Rosenberg, “NETtalk: Parallel networks that learn to pronounce English text,” Complex Systems, vol. 1, no.1, pp. 145-168, 1987. [8] P.C. Bagshaw, “Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression,” Computational Linguistics, vol. 12, pp. 119-142, 1998. [9] R. Sproat, J. Hu, H. Chen, “Emu: An e-mail preprocessor for text-to-speech,’’ Proc. IEEE Workshop on Multimedia Signal Proc., pp. 239–244, Dec. 1998. [10] C.-H. Wu and J.-H. Chen, “Speech activated telephony e-mail reader (SATER) based on speaker verification and text-to-speech conversion,’’ IEEE Trans. Consumer Electronics, vol. 43, no. 3, pp. 707716, Aug. 1997. [11] L.R. Rabiner, “On the use of autocorrelation analysis for pitch detection,” IEEE Trans. Acoustic, Speech and Signal Proc., vol. 25, no. 1, pp. 24-33, Feb. 1977. [12] L.R. Rabiner, M.J. Cheng, A.E. Rosenberg and C.A. McGonegal, “A comparative performance study of several pitch detection algorithms,’’ IEEE Trans. Acoust., Speech, and Signal Proc., vol. 24, pp. 399418, Oct. 1976. [13] E. Moulines and F. Charpentier, “Pitch synchronous waveform processing techniques for text-tospeech synthesis using diphones,” Speech Commun., vol. 9, pp. 453–467, 1990. [14] J.-H. Chen and Y.-A. Kao , “Pitch marking based on an adaptable filter and a peak-valley estimation method,” Int. J. Comp. Linguistics and Chinese Language Proc., vol. 6, pp. 1-12, Aug. 2001.
15