Myanmar language. This paper shows the testing results with the varieties of
overlapping pitch marks for speech waveforms of. Myanmar sentence. The result
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013
Diphone-Concatenation Speech Synthesis for Myanmar Language Ei Phyu Phyu Soe, Aye Thida University of Computer Studies, Mandalay, Myanmar Abstract— Speech Synthesis is a popular field in Natural Language Processing of computer science. It is composed of Natural Language Processing (NLP) and Digital Signal Processing (DSP). This paper gives about the Digital Signal Processing part for speech synthesis for Myanmar language using diphone-concatenation method. Diphone-Concatenation method based on the diphone level of speech to concatenate by applying Pitch Synchronous Overlap and Add (PSOLA) algorithm to smooth the joints of the speech signals. PSOLA has two parts: Time-Domain and Frequency-Domain. This paper describes the Time-Domain Pitch Synchronous Overlap and Add method in diphone-concatenation speech synthesis. One of the contributions of this paper is building the Myanmar diphone database for diphone-concatenation speech. Concatenative synthesis will provide to reduce the problems of speech synthesis by using the formant synthesis. Diphone Database for Myanmar pronunciations is constructed in this research to reduce the ambiguity in pronunciations. In the process of diphone-concatenation synthesis, space complexity and searching time are less than other techniques. This paper illustrates the techniques to improve the performance of text-to-speech in the Myanmar speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. It is based on the signal into overlapping synchronized frames of the pitch period. The diphone-concatenation of the speech synthesis is to maintain the consistency and accuracy of the pitch marks of the speech signal and diphone database with integrated vowels and consonants of Myanmar language. This paper shows the testing results with the varieties of overlapping pitch marks for speech waveforms of Myanmar sentence. The result shows the quality of speech synthesis according the number of overlap times are bigger the quality of speech are better. This paper is to be able to take a word sequence and produce “human-like” speech. Index Terms—PSOLA, diphone-concatenation, synthesis, Myanmar speech, text-to-speech.
I. INTRODUCTION The text-to-speech synthesis system imitates the human-like speech from the input Myanmar text to spoken language. Since this generally requires great language knowledge, the context where the text comes from, a deep understanding of the semantics of the text content and the relations. However, many research and commercial speech synthesis systems developed have contributed to our understanding of all these phenomena, and have been successful in various respective ways for many applications such as in speech-to-speech machine translation, interactive voice response systems, reading software for the blind, linguistic research and language teaching center. Text-To-Speech technology gives computers the ability of converting text into audible speech, with the goal of being able to deliver information via voice message. It has been utilized to provide easier means of communication and to improve
accessibility for people with visual impairment to textual information. Two quality criteria are proposed for deciding the quality of a TTS synthesizer. Intelligibility – it refers to how easily the output can be understood. Naturalness – it refers to how much the output sounds like the speech of a real person. Most of the existing systems have reached a fairly satisfactory level for intelligibility, while significantly less success has been attained in producing highly natural speech [1]. II. TEXT-TO-SPEECH SYNTHESIS A TTS voice is a computer program that has two major parts: a natural language processing (NLP) which reads the input text and translates it into a phonetic language and a digital signal processing (DSP) that converts the phonetic language into spoken speech. The input text might be for example data from a word processor, standard ASCII from e-mail, a mobile text-message, or scanned text from a newspaper. The character string is then preprocessed and analyzed into phonetic representation which is usually a string of phonemes with some additional information for linguistic representation. A TTS system generally consists of four modules, namely text analysis, phonetic analysis, prosody analysis and speech synthesis. In the step of text analysis, there have two parts: syllable segmentation and number converter. The input text is analyzed to segmented Myanmar text like a syllable. Syllable segmentation is the process of identifying syllable boundaries in a text. This process provides to generate the phonetic sequences using the phonetic dictionary. The purpose of number converter is to convert the number to textual versions. The non standard words are tokens like numbers, which need to be expanded into sequences of Myanmar words before they are pronounced. Number expands to string of words representing cardinal. Phonetic analysis is also called Grapheme-to-Phoneme. Grapheme-to-Phoneme conversion translates the syllable of Myanmar text to phonetic sequence. It determines the pronunciation of a syllable based on its spelling. It also analyzes the best sequence of phonemes for words, numbers and symbols and converts into phonetic sequences. We will construct the Myanmar phonetic dictionary to generate the phoneme sequence and to pronounce these phonemes. The phonetic sequences are analyzed to produce the prosodic features by applying the phonological rules in prosodic analysis step. It is the module to analyze duration and intonation such as pitch variation, syllable length to create naturalness of synthetic speech. The combination of consonant phoneme and a vowel phoneme produces a syllable. The phonetic alphabet is usually divided in two main categories, vowels and consonants. Vowels are always voiced sounds and they are produced with the vocal cords in vibration, while 1078
All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 consonants may be either voiced or unvoiced. Vowels have considerably higher amplitude than consonants and they are also more stable and easier to analyze and describe acoustically. Because consonants involve very rapid changes they are more difficult to synthesize properly. Speech synthesis, automatic generation of speech waveforms, has been under development for several decades [12], [13].Synthesized speech can be produced by several different methods. The methods are usually classified into three groups: Articulatory synthesis, which attempts to model the human speech production system directly. Formant synthesis, which models the pole frequencies of speech signal or transfer function of vocal tract based on source-filter-model. Concatenative synthesis, which uses different length prerecorded samples derived from natural speech. The concatenative synthesis method is the most commonly used in present synthesis systems. The Concatenative method is becoming more and more popular. The articulatory method is still too complicated for high quality implementations [1]. The aim of paper is to improve the quality of Myanmar Text-To-Speech by applying the Concatenative speech synthesis using TD-PSOLA (Time Domain-Pitch Synchronous Overlap and Add) algorithms. Linguistic analysis stage maps the input text into a standard form and determines the structure of the input, and finally decides how to pronounce it. Synthesis stage converts the symbolic representation of what to say into an actual speech waveform [9].
Myanmar script include special punctuation marks and signs.
characters, numerals,
Table1. Myanmar Character
III. MYANMAR LANGUAGE Myanmar writing does not use white spaces between words or between syllables. Thus, the computer has to determine syllable and word boundaries by means of an algorithm such as finite-state and rule-based. Moreover, a Myanmar syllable can be composed of multiple characters. Syllable segmentation is the process of determining word boundaries in a piece of text. Myanmar language can consist of one or more morphemes that are linked more or less tightly together. Typically, a word will consist of a root or stem and zero or more affixes. Words can be combined to form phrases, clauses and sentences. A word consisting of two or more stems joined together is known as a compound word. To process text computationally, words have to be determined first [2]. The purpose of this paper is to develop Myanmar Text-To-Speech system and to improve the performance of high quality synthesis by applying the diphone-concatenation speech synthesis. The Myanmar language is the official language of Myanmar and is more than one thousand years old. Texts in the Myanmar language use the Myanmar script, which is descended from the Brahmi script of ancient South India. Other Southeast Asian descendants, known as Brahmic or Indic scripts, include Thai, Khmer and Lao. A Myanmar text is a string of characters without explicit word boundary markup, written in sequence from left to right without regular inter-word spacing, although inter-phrase spacing may sometimes be used. Myanmar characters can be classified into three groups: consonants, medials and vowels. The basic consonants in Myanmar can be multiplied by medials. Syllables or words are formed by consonants combining with vowels. However, some syllables can be formed by just consonants, without any vowel. Other characters in the
There are 34 basic consonants in the Myanmar script, as displayed in Table1. They are known as “Byee” in the Myanmar language [2]. Consonants serve as the base characters of Myanmar words, and are similar in pronunciation to other Southeast Asian scripts such as Thai, Lao and Khmer. Medials are known as “Byee Twe” in Myanmar. There are 4 basic medials and 6 combined medials in the Myanmar script. The 10 medials can modify the 34 basic consonants to form 340 additional multi-clustered consonants. Therefore, a total of 374 consonants exist in the Myanmar script, although some consonants have the same pronunciation. Vowels are known as “Thara”. Vowels are the basic building blocks of syllable formation in the Myanmar language, although a syllable or a word can be formed from just consonants, without a vowel. Like other languages, multiple vowel characters can exist in a single syllable. Special characters for Myanmar language are used as 1079
All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 prescription noun and conjunctions words between two or more sentences. Numerals for Myanmar language are known as “Counting Numbers”. Numerals are the 10 basic digits for counting. IV. MYANMAR LANGUAGE PHONOLOGY A phoneme is the smallest unit that distinguishes words and morphemes. Therefore, changing a phoneme of a word to another phoneme produces a different word or a nonsense utterance, whereas changing a phone to another phone, when both belong to the same phoneme, produces the same word with an odd or an incomprehensible pronunciation. Phonemes are not physical segments themselves, but mental abstractions of them [5]. Different acoustic realizations of a phoneme are called allophones. The acoustic characteristics of phonemes come from the vocal tract movement during their articulation. There are three types of phonetic parameters in phonology of Myanmar language: first is place of articulation, second is articulator and third is manner of articulation [3]. The pronunciation of Myanmar words depend on these parameters. A phoneme is a contrastive unit in the sound system of a particular language. It is a minimal unit that serves to distinguish between meanings of words. Phoneme can pronounce in one or more ways, depending on the number of allophones. It can represent between slashes by convention. Table2 describes the inventory of Myanmar consonant phonemes defined by the International Phonetic Association (IPA) [3], [4], [11]. Table2. The inventory of Myanmar consonant phonemes
A. Myanmar Phonological Tones Myanmar language has four tones and a simple syllable structure that consists of an initial consonant followed by a vowel with an associate tone. This means all syllables in Myanmar have prosodic features. Different tone makes different meanings for syllables with the same structure of phonemes. In the Myanmar writing system, a tone is presented by a diacritic mark [3], [4]. The fundamental frequency as shown in figure1 rises gradually from Tone 1 to Tone 4. Tone 1 starts at a relatively level range and tends to go down slightly; Tone 2 starts at a relatively level range, goes up, and then falls down relatively low; Tone 3 starts at a relatively high range, usually higher than or as high as the peak of Tone 2, and falls down relatively low; Tone 4 stats at a high range, frequently higher or as high as the peak of Tone 2 and falls low, but not as low as Tone 3 because it stops very suddenly before it can drop lower [4]. The general contrastive features of the four phonological tones offered by the analysis of their fundamental frequency can be described as figure1:
Figure1. Four Tones of Myanmar Language There are four tones in Myanmar language. The lengths of tones are: Tone 1 has 18.50 Cs Tone 2 has 21.03 Cs Tone 3 has 15.44 Cs and Tone 4 has 10.35 Cs. So Myanmar toneme is described with the variety of rate or duration. Length of the tone is defined as rate or duration. Tone 2 is defined as a longest rate and tone 4 is defined as a shortest rate in these four tones. Now we describe with the redundant features of these four tones in Table3. Table3. Features of Myanmar Tones Description Tone1 Tone2 Tone3 Tone4 Rate 2 3 1 0 Duration 18.5 21.03 15.44 10.35 Low + + High + + + Low-Falling + + + B. Phonological Structure of Myanmar Language The Myanmar language uses a rather large set of 50 vowel phonemes, including diphthongs, although its 22 to 26 consonants are close to average. Some languages, such as French, have no phonemic tone or stress, while several of the Kam-Sui languages have nine tones, and one of the Kru languages, Wobe, has been claimed to have 14, though this is disputed. The most common vowel system consists of the five vowels /i/, /e/, /a/, /o/, /u/. The most common consonants are /p/, /t/, /k/, /m/, /n/. Relatively few languages lack any of these, although it does happen: for example, Arbic lacks /p/, standard Hawaiian lacks /t/, Mohawk and Tlingit lack /p/ and /m/, Hupa lacks both /p/ and a simple /k/, colloquial Samoan lacks /t/ and /n/, while Rotokas and Quileute lack /m/ and /n/ [5]. Table4 shows the phonetic signs of 50 Myanmar vowels to pronounce the Myanmar words. These 50 phonemes show the basic symbol with four tone levels [3].
Figure2. Combination of Phoneme Syllable Phonology is how speech sounds are organized and affect one another in pronunciation. The combination of consonant phoneme and a vowel phoneme produces a syllable in figure2. The phonetic alphabet is usually divided in two main 1080
All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 categories, vowels and consonants. Vowels are always voiced sounds and they are produced with the vocal cords in vibration, while consonants may be either voiced or unvoiced. Vowels have considerably higher amplitude than consonants and they are also more stable and easier to analyze and describe acoustically. Because consonants involve very rapid changes they are more difficult to synthesize properly [6].
decision is true, the system will change the / tiɂ/ to [də] phoneme. If it is not, this decision will continue to next rule. The changing to “DA” pronunciation algorithm for the rule type of metathesis is illustrated in figure4 [7].
Table4. Phonetic Signs of Myanmar Vowels
Figure3. Vowel Reduction Algorithm
C. Five Phonological Rules for Myanmar Language Why construct the phonological rules? In Myanmar Language, speech of Myanmar language has two types: Sentence-based speech and Word-based speech. These two types are described in this paper. Word-based sentences are convenient by applying Myanmar phonetic dictionary. Sentence-based speech for Myanmar language is proposed in this research. So, the sentence-based speech problem can solve by applying phonological rules. Phonological rules are often written using distinctive features, which are natural characteristics that describe the acoustic and Articulatory makeup of a sound; by selecting a particular bundle, or "matrix," of features, it is possible to represent a group of sounds that form a natural class and pattern together in phonological rules [10]. There are many phonological rules in Myanmar language not only phonological rules without part of speech levels but also phonological rules with grammars structures. The problems for sentence-based speech pronunciations for Myanmar language solve by applying the five phonological rules [4], [7]. D. Algorithms for Five Phonological Rules Rule 1 uses the substitute for communication theory, computational linguistics (for instance, statistical natural language processing). It uses the reduction phonological rules to reduce the vowels with glottalized (ɂ) and nasal (˜) tones. The algorithm of vowel reduction algorithm for rule 1 as shown in figure3 [7]: Rule 2 describes the pronunciation changes from /tiɂ/ to [də] by applying the metathesis rule type when the next phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. The system finds the number / tiɂ/ in input phoneme sequence and it checks the next phonemes are /ka/, /sa/, /za/, /ta/ and /pa/. If this
Figure4. Changing to „DA‟ Pronunciation Algorithm Rule 3 is the inserting the nasal phoneme according to obstruent types as shown in Table5. If the obstruent type is bilabial next to the voiced vowel, „[m]‟ will fill in this phoneme. If the obstruent type is dental next to the voiced vowel, „[ṉ ]‟ will fill in this phoneme. If the obstruent type is alveolar next to the voiced vowel, „[n]‟ will fill in this phoneme. If the obstruent type is palate-alveolar next to the voiced vowel, „[ɲ ]‟ will fill in this phoneme. If the obstruent type is velar next to the voiced vowel, „[ŋ]‟ will fill in this phoneme. Table5. Obstruent Types and Nasal Phonemes
The process of rule 3 is the inserting nasal phoneme between the voiced asats and obstruent consonants. The system find the voiced vowel asat in the input phoneme sequence and then check the types of obstruent consonants. If the decision is consistent, the nasal phoneme inserts between them and if it is not, the system goes to the next rules. Figure6 (a) describes the process of five types of obstruent types. Figure6 (b) shows the detail processes for five processes of five obstruent types.
1081 All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013
Figure5. (a) Filling Nasal Phoneme Algorithm
Figure5. (b) Processes of Procedures for Filling Nasal Algorithm Rule 4 explains about the unchanged pronunciation phonemes next to the /aɂ/ အ. The unvoiced phonemes are not changed to voiced phoneme when the preceding phoneme is /aɂ/အ. Rule 5 is the pronunciation changes from unvoiced phonemes to voiced phonemes depending on the voiced consonants, voiced vowels and voiced asats. If the unvoiced phoneme locates in the first position, this phoneme is not change to voiced phoneme. If the unvoiced consonants with voiced vowels or voiced asat, it will change to the voiced phoneme. The process of rule 4 and rule 5 is made use of pronunciation algorithm [7]. The combination of rule 4 and 5 pronunciation algorithm is presented in following figure6. This algorithm can solve the confusion of unchanged and changed pronunciation for unvoiced to voiced phoneme. The algorithm finds the unvoiced consonants in input phoneme sequence. If the system found the unvoiced consonants, first step is to check this consonant‟s location and if it locates in first position, this unvoiced consonant is not changed to voiced consonants. If this unvoiced consonant is not in first location and current and previous vowels asats are unvoiced in phoneme sequence, the consonants are not changed to voiced consonants. If it is not and the previous consonant is /aɂ/, the pronunciation is not changed to voiced consonant [7].
Figure6. Changing Pronunciation Algorithm V. DESIGN OF CONCATENATIVE SPEECH SYNTHESIS The process of concatenative speech synthesis is cutting and pasting the short segments of speech is selected from a pre-recorded database and joined one after another to produce the desired utterances. In theory, the use of real speech as the basis of synthetic speech brings about the potential for very high quality, but in practice there are serious limitations, mainly due to the memory capacity required by such a system. The longer the selected units are the fewer problematic concatenation points will occur in the synthetic speech, but at the same time the memory requirements increase. Another limitation in Concatenative synthesis is the strong dependency of the output speech on the chosen database. For example, the personality or the affective tone of the speech is hardly controllable. Despite the somewhat featureless nature, Concatenative synthesis is well suited for certain limited applications [1]. Concatenative synthesis is based on the concatenation or stringing together of segments of recorded speech. Generally, Concatenative synthesis produces the most natural-sounding synthesized speech. It is easier to obtain more natural sound with longer units and it can achieve a high segmental quality. Among these techniques, this paper highlights a diphone concatenation-based synthesis technique in Myanmar text-to-speech research.
1082 All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 A. Myanmar Diphone Database Construction The basic idea behind building Myanmar diphone databases is to explicitly list all possible phone-phone transitions in a language. One technique is to use target words embedded carrier sentences to ensure that the diphones are pronounced with acceptable duration and prosody. Speech synthesis unit finds the corresponding pre-recorded sounds from its database and tries to concatenate them smoothly. It uses an algorithm like TD-PSOLA (Time-Domain Pitch Synchronous Overlap and Add) to make a smooth pass in diphone. PSOLA method takes two speech signals. One of these signal ends with a voiced part and the other starts with a voiced part. PSOLA changes the pitch values of these two signals so that pitch values at both sides become equal. The advantage of this technique is to obtain a better output speech when compared to other techniques [1]. The structure of diphone database constructs with Arpabet signs to understand the retrieving phonemes. After retrieving the phonemes, we can then retrieve each individual phoneme from a diphone database and concatenate them together with only 50 phonemes; this would be the most economical choice to save space on embedded devices. Diphones are just pairs of partial phonemes. This might be recovered from the pronouncing dictionary by taking into account the 1 or 0 designation applied to vowels concerning stress instead of representing a single phoneme; a diphone represents the end of one phoneme and the beginning of another. This is significant because there is less difference in the middle of a phoneme than there is at the beginning and ending edges [15]. The problem is that it greatly increases the size of the diphone database from around 10496 diphones (114 (22Consonants + 42ExceptionWords + 50Vowels) x 114 (22Consonants + 42ExceptionWords + 50Vowels) – 2500 (50Vowels x50Vowles)) in Myanmar Language. The pair of vowel and vowel is not in phoneme sequence for Myanmar diphone database. So the number of double vowels subtracts from the total diphone database. The Arpabet signs for 22 consonants are described in the following Table6 and the Arpabet sing for 50 vowels [14] are shown in Table7. The diphone list will be categorized in different categories [15]: Consonants-Consonants, Consonants-Exception Words, Consonants-Vowels, Exception Words-Consonants, Exception Words-Exception Words, Exception Words-Vowels, Vowels-Consonants, Vowels-Exception Words, Consonants-Silence, Exception Words-Silence, and Vowels-Silence, Silence-Consonants, Silence-Exception Words and Silence-Vowels pairs.
Table6. Arpabet Signs for 22 Consonants Phoneme for Consonants Arpabet for Diphone k
Table7. Arpabet Signs for 50 Vowels
B. Diphone Recording The recordings were read by a native Myanmar speaker. The recordings were done in a professional recording at LA studio in Mandalay, Myanmar. The diphone database was completed in four hours. Two hundred sentences of different length were recorded. The reason for recording the sentences was to start building a Unit selection database to be able to re-synthesize the sentences. The sentences were taken from two sources. Firstly, news takes in Myanmar daily newspaper. This source was chosen due to its use of modern formal language. Approximately thirty sentences were chosen from this source. These sentences have an average word length of 20 words. The recordings were tough to achieve because of the sentences length. The second source for the remainder of the sentences is from “Myanmar Grammar Book” published from Myanmar language group. These sentences are short and easy to use since the vowel ling is already done. The language and grammar within the book is modern, therefore a good starting point for testing the system.
1083 All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 VI. LABELING DIPHONE INDEX A diphone database consists of a dictionary file, a set of waveform files and a set of pitch mark files. The dictionary file, also called the diphone index, identifies which diphone comes with which files, and from where. The index consists of a simple header, followed by a single line for each diphone: the diphone name, the file name without any extension, a point start position in seconds, a mid position and an end position also in seconds [8]. Table8 describes the labeling diphone index for the sentence of “#-KYA-AY4HT-IY2-Y-AA2-TH-IY1-D-AH-D-AW1-T-EH4-D-AH-DAW1-Z-IH2-PHY-IH3-AA3-SA-IH2-AA3-T-EH4-MY-AA 2-TH-IY1-#”. Table8. Labeling the Diphone Index
The structure of diphones runs from one mid pitch mark of first phone to another mid pitch mark of the following phone. The pitch mark files consists a simple list of positions in seconds in order, one per line of each pitch mark in the file.
fundamental frequency and the intensity of the segments must be controllable. The creation of natural prosody in synthetic speech is impossible with the present-day methods but some promising methods for getting rid of the discontinuities have naturally been developed. Finally, Concatenative speech synthesis is afflicted by the troublesome process of creating the database from which the units will be selected. Each phoneme, together with all of the needed allophones, must be included in the recording, and then all of the needed units must be segmented and labeled to enable the search from the database. VIII. TD-PSOLA METHOD France Telecom (CNET) develops Pitch Synchronous Overlap and Add method. It allows prerecorded speech samples smoothly concatenated and provides good controlling for pitch and duration. Time-domain version, TD-PSOLA, is the most commonly used due to its computational efficiency. The basic algorithm consists of three steps: 1. original speech signal is divided into separate short analysis signal 2. the modification of each analysis signal to synthesis signal and 3. the synthesis step where these segments are recombined by means of overlap-adding [1]. The purpose of TD-PSOLA (Time-Domain) is to modify the pitch or timing of a signal as shown in figure7. The process of the TD-PSOLA algorithm is to find the pitch points of the signal and then apply the hamming window centered of the pitch points and extending to the next and previous pitch point. If the speeches want to slow down, the system defines the frame to double. If the speeches want to speed up, the system removes the frames in the signal.
VII. DIPHONE-CONCATENATIVE SYNTHESIS The diphone-concatenative speech synthesis joins one phone with another phone to reduce the discontinuity of the joints of the phones. This paper highlights a diphone concatenation-based synthesis technique. This synthesis part is a popular challenge of high quality speech production in Myanmar Text-To-Speech System. Concatenative synthesis is a popular method that the most common choices are phonemes and diphones because they are short enough to attain sufficient flexibility and to keep the memory requirements reasonable. The use of diphones in the concatenation promotes to get good performance quality because a diphone contains the transition from one phoneme to another and the latter half of the first phoneme and the former half of the latter phoneme. Consequently, the concatenation points will be located at the center of each phoneme, and since this is usually the steadiest part of the phoneme, the amount of distortion at the boundaries can be expected to be minimized. While the sufficient number of different phonemes in a database is typically around 200, the corresponding number of diphones is from 4500 to 5000 but a synthesizer with a database of this size is generally implementable. To avoid audible distortions caused by the differences between successive segments, at least the
Figure7. TD-PSOLA Algorithm TD-PSOLA requires an exact marking of pitch points in a time domain signal. Pitch marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame. The most common marking point is the instant of glottal closure, which identifies a quick time domain descent. The algorithm creates an array of sample numbers comprise an analysis epoch sequence P = {p1, p2… pn} and it estimates pitch period distance = (pk - pk+1)/2 to get the mid-point of pitch marking. Table9 gives the data for the overlapping time according to the pitch marks with 0.03s of the waveforms of the voices. This table shows the 22 diphone pairs for 20 words Myanmar 1084
All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 sentence. The diphone-concatenation pairs for this sentence have the 20 pairs for speech synthesis. The comparisons of waveforms are shown in below. Figure 8 shows the original waveforms without TD-PSOLA method and the length is 1.206s for first 4 words, “#-KYA-AY4-HT-IY2-Y-AA2-TH-IY1”.
Figure8. Original Waveform of First 4 Words Table9 Defining the Pitch Marks with 0.03s
The following figure9 describes the overlapping waveforms with 0.03s pitch marks. The waveforms smooth between one joint of waveform and another by using TD-PSOLA method for Myanmar language. The quality of speech is more speed and smooth by are overlapping each to each with 0.03s than original speech waveforms. The total length of overlapping speech waveforms is shorter than original waveforms without any method. The length of overlapping waveforms reduces 0.05s from 1.206s. The next table 9 shows the overlapping of pitch marks with 0.05s between one joint and other joints.
Figure9. Overlapping with 0.03s Pitch Marks Table10 shows the overlapping pitch marks with 0.05s of waveforms of the voice for 20 words of Myanmar sentence. The hanning window calculates with the overlap of 0.05s pitch marks in each diphone label. The values of the start of hanning window and the end of hanning windows are changed according to the overlapping pitch marks values. The sound quality is better than the overlapping pitch marks 0.03ms.
Table10 Defining the Pitch Marks with 0.05s
The following figure10 describes the overlapping waveforms with 0.05s pitch marks. The waveforms smooth between one joint of waveform and another by using TD-PSOLA method for Myanmar language. The quality of speech is more speed and smooth by are overlapping each to each with 0.05s than original speech waveforms and 0.03s pitch marks overlapping. The total length of overlapping speech waveforms is shorter than original waveforms without any method. The length of overlapping waveforms reduces 0.12s from 1.206s.
Figure10. Overlapping with 0.05s Pitch Marks IX. EXPERIMENTAL RESULTS This paper gives the results for the diphone-concatenation with TD-PSOLA method. This system is tested with the 200 Myanmar sentences and this sentence structure is very complex. The Myanmar diphone database stores over 5000 diphones for these sentences. Firstly, this system accepts the segmented Myanmar sentence and then it can produce the phonetic sequence with the pairs of consonants and vowels by using Myanmar phonetic dictionary in grapheme-to-phoneme stage [4]. And then this system checks the phonetic sequence to get the prosodic features with phonological rules. Finally, it produces the high quality speech by applying the Myanmar diphone database with concatenation method that uses TD-PSOLA algorithm. The experimental results of diphone-concatenation speech synthesis can be calculated with precision, recall and f-measure. The results for 14 types of diphone pairs according to the total number of 5275 diphone pairs for 200 sentences is shown in Table11. Table11 Experimental Results for Diphone-Concatenation
1085 All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 said the participants understood, 20% of the participants understood the voice very well, 56% of the participants understood well. 11% neither much nor little and another 15% understood a little, i.e. not very well for grapheme-to-phoneme conversion. The results of diphone-concatenation are 82% of the participants understood the voice very well, 88% of the participants understood well. 7% neither much nor little and another 5% understood a little. The comparison of this intelligibility is shown in figure13.
X. TESTING MYANMAR SPEECH QUALITY Testing the naturalness and intelligibility of the Myanmar speech contains 7 female people between the ages 16 to 40. The test can be divided into two parts with 20 pairs of words of confusability. The first part contains naturalness of the diphone-concatenative speech synthesis. The last part tests how much the participants understood the voice or how much of what the voice said the participants understood. The participants heard one word at a time and marked on the answering sheet which one of the two words they think is correct.
Figure13 Intelligibility of the Voice XI. CONCLUSIONS
A. Naturalness The results of the grapheme-to-phoneme part of listening compared to the diphone-concatenation of listening are shown in figure12 below. The system tested with 20 words complex sentence structure. The listeners or users are regarding the question whether the voice is nice to listen to or not, 70% considered the voice natural, 60% thought that the naturalness of the voice was acceptable and 25 % considered the voice unnatural for grapheme-to-phoneme conversion. The users regard 95% considered the voice natural, 72% thought that the naturalness of the voice was acceptable and 5 % considered the voice unnatural for diphone-concatenation synthesis. The results changed slightly after the second time of listening.
Figure12 Comparison of Naturalness
B. Intelligibility The results of intelligibility of the voices for grapheme-to-phoneme conversion and the speech of diphone-concatenation synthesis are shown in figure13. The system tested with 20 words complex sentence structure. The questions of intelligibility for the listeners or users are asked understood or not the voice or how much of what the voice
This paper describes the diphone-concatenation speech synthesis TD-PSOLA method that is tested with the 200 Myanmar sentences and these sentence structures are complex. The Myanmar diphone database stores over 5000 diphones for these sentences. Firstly, this system accepts the segmented Myanmar sentence and then it can produce the phonetic sequence with the pairs of consonants and vowels by using Myanmar phonetic dictionary in grapheme-to-phoneme stage. And then this system checks the phonetic sequence to get the prosodic features with phonological rules. Finally, it produces the high quality speech by applying the Myanmar diphone database with concatenation method that uses TD-PSOLA algorithm. This paper shows the comparison of overlapping method with variety of time domain such as 0.03s and 0.05s pitch marks of hanning windows. The overlapping pitch marks 0.05s is better than original waveforms and 0.03s overlapping pitch marks waveforms. The comparison of high speech quality for naturalness and intelligibility of the grapheme-to-phoneme and diphone-concatenation synthesis are illustrated in this paper. This system can be promoted for the simple Myanmar sentences. It cannot be provided the pali and Sanskrit of Myanmar language. The phonological rules can be extended to other 15 phonological rules. But this system can be processed by five phonological rules for changing to unvoiced to voiced pronunciations. This paper describes about the TD-PSOLA method for Myanmar language TTS system. TTS system can be extended with other methods such as source-filter model with Festival tools or diphone-concatenative synthesis with FD-PSOLA method. ACKNOWLEDGMENT I wish to express my deep gratitude and sincere appreciation to all persons who contributed towards the success of my research. It was a great chance to opportunity 1086
All Rights Reserved © 2013 IJSETR
ISSN: 2278 – 7798 International Journal of Science, Engineering and Technology Research (IJSETR) Volume 2, Issue 5, May 2013 to study Myanmar text-to-speech research in one of the most famous research in the world. I would like to respectfully thank and appreciate Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Mandalay (UCSM), for her precious advice, patience and encouragement during the preparation of my research. I am grateful to my supervisor, Dr. Aye Thida, an Associate Professor of Research and Development Department (1) at University of Computer Studies, Mandalay (UCSM), and Myanmar, one of the leaders of Natural Language Processing Project for having me helped during the preparation of my research. She was, I am also deeply thankful to Dr. Aye Thida, for her advising from the point of view of natural language processing. I also take this opportunity to thank all our teachers of the University of Computer Studies, Mandalay (UCSM), for their teaching and guidance during my research life. I especially thank my parents, my sisters and all my friends for their encouragement, help, kindness, providing many useful suggestions and giving me their precious time give to me during the preparation of my research.
Miss. Ei Phyu Phyu Soe is a candidate of Ph.D of computer science in University of Computer Studies, Mandalay, and Myanmar. Her research interests include Data Mining, Database Management System, Natural Languge Processing and Liguistic Research. She is currently working in the research of Speech Synthesis for Myanmar Language. Ei Phyu Phyu Soe received B.C.Sc, M.C.Sc degrees from the Computer University, Mandalay, and Myanmar.
REFERENCES [1] [2] [3]
[5] [6]
[7] [8]
[9] [10] [11]
[12] [13] [14] [15]
S. Lemmetty, “Review of Speech Synthesis Technology”, Master‟s Thesis, Helsinki University of Technology, 1999. Tun Thura Thet; Jin-Cheon Na; Wunna Ko Ko, “Word segmentation for the Myanmar language”. Dr. Thein Tun, “Acoustic Phonetics and the Phonology of the Myanmar Language”, School of Human Communication Sciences, La Trobe University, Melbourne, Australia, 2007. Ei Phyu Phyu Soe, “Grapheme-to-Phoneme Conversion for Myanmar Language”, the 11th International Conference on Computer Applications (ICCA 2013). “Phoneme”,, April 2012. D.J. RAVI Research Scholar, “Kannada Text to Speech Synthesis Systems: Emotion Analysis”, JSS Research Foundation, S.J College of Engg, Mysore-06, 2010. Ei Phyu Phyu Soe, “Prosodic Analysis with Phonological Rules for Myanmar Text-to-Speech System”, AICT 2013. Alan W Black and Kevin A Lenzo. Building Synthetic Voices, For FestVox 2.0 Edition. Language Technologies Institute, Carnegie Mellon University and Cepstral, LLC, 2003b. Tractament Digital de la Parla, “Introduction to Speech Processing”. Hayes, Bruce (2009). “Introductory Phonology.” Blackwell Textbooks in Linguistics. Wiley-Blackwell. ISBN 978-1-4051-8411-3. International Phonetic Association, “Phonetic description and the IPA chart", Handbook of the International Phonetic Association: a guide to the use of the international phonetic alphabet, Cambridge University Press, 1999. Kleijn K., Paliwal K. (Editors), “Speech Coding and Synthesis”. Elsevier Science B.V., the Netherlands, 1998. Santen J., Sproat R., Olive J., Hirschberg J. (editors), “Progress in Speech Synthesis”, Springer-Verlag New York Inc, 1997 “Arpabet”, 26 October 2012, [online] Available: 2012. Maria Moutran Assaf , “A Prototype of an Arabic Diphone Speech Synthesizer in Festival”, 2005.
Myanmar to English Translation System Project (NLP) Dr. Aye Thida University of Computer Studies, Mandalay (UCSM), Myanmar Research and Development Department (1) Dr. Aye Thida is an Associate Professor of Research and Development Department (1) at University of Computer Studies, Mandalay (UCSM), and Myanmar. She was one of the leaders of Natural Language Processing Project. Her team has developed Myanmar to English Translation System in 2011. Her research interests include Distributed Processing, Queuing and Natural Language Processing. She is currently working Myanmar to English Translation System Project. Dr. Aye Thida received B.Sc(Hons)Maths degree from the Mandalay University, Myanmar and her M.I.Sc and Ph.D degrees in Computer Science from the University of Computer Studies, Yangon(UCSY), Myanmar.
1087 All Rights Reserved © 2013 IJSETR