2013 International Conference on Control Communication and Computing (ICCC)
Implementation of Malayalam Text to Speech Using Concatenative Based TTS for Android Platform Arun Gopi, Shobana Devi P, Sajini T, Bhadran V K Language Technology Section Centre for Development of Advanced Computing Thiruvananthapuram, India
[email protected] ,
[email protected] ,
[email protected],
[email protected]
One of the main issues the Differently Abled (DA) community experience is the day to day challenge in coping with their impairment. Mobile phones are a boon to them. The mobile technology developed; nowadays focus on providing full accessibility support. A mobile device equipped with specific applications can aid visually impaired/blind people in functioning. An application like TTS will help them in many ways like text reader which they can carry around, as online news readers, as a navigation aid in Global Positioning System (GPS) enabled devices. These apps will help them to improve their living standard and reduce the dependency in daily life. TTS provides computers ability to produce sound that resembles human speech. Even though synthetic speech cannot imitate the full spectrum of human cadences and intonations; it can read text and generate intelligible speech output with fairly good naturalness. The different techniques for speech synthesis are articulatory synthesis, formant synthesis, concatenative synthesis, parametric synthesis etc. Articulatory synthesis generates speech modelling the human vocal organs as perfectly as possible but is very complex. Formant synthesis models uses the pole frequencies of speech signal or transfer function of vocal tract based on source-filtermodel. This system uses a heuristic rules to derive the model and is time consuming. Concatenative based system generates speech by concatenating original speech segment and quality depends on the unit used for concatenation and depends on the size of database. The concatenative based system gives the most satisfactory output, since the knowledge about the transitions are embedded in the segment and uses different length pre-recorded samples derived from natural speech. Statistical parametric speech synthesis based on hidden Markov models (HMMs) has many advantages over concatenative speech synthesis, but its major limitation is the quality of synthesized speech. Zen et al [6]. TTS system based on Concatenative based synthesis techniques, is one of the most successful approaches for synthesizing speech, [2].
Abstract—The recent development in text to speech has been switched to concatenative synthesis, either using original speech segments or parametric synthesis. The former TTS system gives a better quality output since they use the original speech segment for concatenation. There are a number of different other techniques for speech generation like PSOLA, TDPSOLA, EMBROLA etc. This paper describes the development and implementation of concatenative based system based on Epoch Synchronous Non Overlap and Add (ESNOLA) technique for Malayalam in android platform. The TTS uses diphone like segments (partneme) as the basic units for concatenation. The database contain 1500 partnemes, which are used for generating speech for unlimited domain text. The paper also briefs about the implementation of Malayalam TTS, the database generation, the modification done for android platform, the database access and handling Malayalam character display in android platform, the support provided in the TTS app for displaying characters with proper rendering. The app support android latest versions upto android 4.2. The design for Newsreader, an application using android is also discussed in this paper. The TTS gives a Mean Opinion Score (MOS) of 3.2 in the perceptual test. . Keywords—esnola; concatenative synthesis; partneme; psola; text normalisation
I. INTRODUCTION Digital systems have become part and parcel of daily life. The device like PC, mobile phones and applications used in day to day has become complex and automated. The interaction between man and machine became more demanding, arisen a new requirement for Man Machine Interaction (MMI). Speech is the most natural way for humans to communicate and has been under development for several decades (Santen et al. 1997, Kleijn et al. 1998). Applications supporting MMI in local language has more acceptances. In the present scenario mobile devices/phones became a part of human life; the usage of mobile applications has increased drastically, and will replace PC in near future. The mobile devices now available in market has significant computational capabilities, and has become a near-by always-on internet access point, with satisfying audio, display points. The widespread availability and popularity of mobile phones arises the need for PC based application available in mobile platform.
978-1-4799-0575-1/13/$31.00 ©2013 IEEE
The quality of a TTS system is determined by two factors intelligibility and naturalness. Intelligibility is associated with consonants, in which the place of articulation determines the finer discrimination consonants rather than vowel. Naturalness is associated with vowels [1].
184
The block diagram of an ESNOLA based TTS is given in Fig.1.The text analyzer accepts the input string, normalizes the string with language specific modifications, apply language specific rules and generate the phonetic string. Synthesizer is the core part which converts the string to wave with fairly good naturalness.
We have chosen android platform for the application development since, android is the largest installed base of any mobile platform and growing fast. Android also provides a world-class platform for creating apps everywhere, as well as an open marketplace for distributing to them instantly. In this paper we describe the development of TTS for Malayalam for mobile platform using ESNOLA engine. The paper also describes the implementation of online news reader an application focusing the visually impaired and low vision for reading online articles in android platform. The application can download news from the links of online news sites supporting Unicode encoding. The paper is organized as Section 2 covers the details Malayalam TTS and existing TTS on mobile device; Section 3 gives the details on porting TTS for Android platform. Section 4 and 5 discuss about the issues in current implementation and future plans.
A. The Text Analyzer Module The text analyzer is the front end text processing unit, which process the text and generate a text output applying language specific rules. Text analyzer accepts text in standard encoding (Unicode/UTF8) as input and generates the phonetic string as output. The text analyzer reads the input text and cleans the input text. Cleaning is done to remove all unwanted characters, common typing errors etc. Cleaned text is then parsed to the NLP module to extract abbreviations, numbers etc. and to apply LTS rules. Abbreviations are replaced with its expansion. In the current implementation frequently used abbreviations are handled. The numbers in figures must be converted to words before applying to synthesizer. The normalization[3] of numbers in Malayalam is not simple as other Indian languages like Hindi, Bangla etc. The numbers usually appear in figure along with the suffix patterns. The text analyzer identifies the number and suffix, replace the number and suffix with its equivalent expansion after applying agglutination rules.
II. ESNOLA TTS FOR MALAYALAM The main modules of a TTS system are the text analyzer module and the synthesizer. The text analyzer along with the Natural Language Processing (NLP) unit generates the phonetic string, in the format that can be processed by the synthesizer. The NLP module determines the accuracy/correctness of the units for concatenation. The synthesizer reads this input; identify the units (partneme -part of phones) to be concatenated. The synthesizer then selects the partneme from database and concatenates the selected partneme to generate the speech output. While concatenation the synthesizer applies some signal processing to adjust the pitch, and duration. The synthesizer also takes care to reduce the distortion at the concatenation point by ensuring Epoch synchronous concatenation. . Input text in Unicode format
Text
Analyzer
Phonetic string
Natural Language processing unit
Letter to sound rules and Exceptional patterns
Prosodic Phrasing rules
Speech output
Fig. 2. Number suffix pattern handling
Synthesizer
The NLP module also applies LTS rules this preprocessed string. Applying LTS rules for Indian languages is easier when compared to foreign languages like English which uses dictionary lookup for finding pronunciation since they are phonetic in nature. Even though said to be phonetic, there exist some pronunciation exceptions. These exceptions must be handled for improving the quality of synthesized speech. Dictionary based approach is not feasible in Indian language because of the agglutinative nature.
Database
Preprocessing cleaning and segmentation of speech signals
Pronunciation for Malayalam is generated by a semiautomatic process, which uses language specific rules and exception patterns for generating the pronunciation for the given input string [5]. For a good quality speech synthesis the use of proper pronunciation is inevitable. In addition to the letter to sound rules mentioned in [4], pronunciation variation for /k/ gemination is also incorporated. Taking care of such variations in /k/ gemination ensures the availability of all
Fig. 1. Block diagram of ESNOLA based TTS
185
variations in the database. The NLP module applies the language specific rules, generate the equivalent pronunciation.
frequency. This adjustment is done if the difference exceeds 10% of the fundamental frequency [1].
For naturalness in the synthesized speech proper pausing must be added. The NLP module inserts pause tags as required by identifying the prosodic phrases, based on the prosodic phrasing rules. The rules for phrasing breaking and duration of pauses are derived from annotated corpus.
4) Amplitude Normalization: Amplitude normalisation is done to reduce the spectral mismatch at the concatenation points. This is done for all the segments, like ConsonantVowel (CV), Vowel-Consonant (VC) and Vowel-Vowel (VV) relative to the vowel.
B. Speech Database The speech database is created using partnemes which are diaphone like units, covering the all possible combinations for the language. The advantage of using partnemes as the basic unit is the simplicity of introducing intonation and prosodic rules into the synthesized speech signals [1]. The creation of partneme database involves recording of the nonsense words to get all possible combinations in neutral mode, with almost constant pitch and amplitude. These recordings are then amplitude normalization and segmented to get the partneme. . The details of speech database creation process is as follows
5) Partneme Segmentation: The partnemes segments are divided into two groups; the pure forms and the transitions. The pure forms consists of the segments of occlusion or voicebar along with the plosion, affrication, nasal murmurs etc. The transitions group in the partneme database representing all the co articulatory regions for CV VC and VV. Vowels are represented by a single cycle which is a segment selected from the utterance for a pure vowel. The Vowel Onset Time (VOT), which is an integral part of consonants like plosives, is not included in the consonant part. This is covered in the VC and CV transitions. Figure 3.1-3.4 shows the details of segmentation done for different phoneme group. The co articulation of VC includes, VOT and VC transition till the steady state of the vowel.
1) Creation of speech corpus for malayalam: ESNOLA engine uses original speech segment to create database. A speech corpus covering all possible variations that may occur in the langauage is a mandatory requirement. In Malayalam even though we have dental and alveolar ‘NA’ the orthography is the same. The script ‘ന‘is used to represent the both. Similarly there exist variations of ‘KA’ when ‘KA’ comes in inter vowel position and when comes with ‘YA’ preceding or succeeding. All such variations must be incorporated in the database generating an accurate speech output. For creating the database, non-sense word set in the form CVCVCV are prepared for all the identified phones. Nonsense words are used for reducing the influence of nearby phones and to reduce the variations in pitch and duration Separate set of words are prepared for vowel-vowel transitions.
Fig.3.1. Representation of Vowel ‘A’ in database
Fig.3.2. Representation of Fricative ‘C’ in database
2) Voice Recording: The quality of speech output directly depends on the qulaity of partneme segments in the database. The recording must be done carefully to ensure this. A good professional voice with consistent quality, to ensure consistency in different sessions of recording in terms of amplitude pitch and rate, is selected for voice recording. Recording must be done in an acoustically treated room preferably professional studio, with specifications as 16bit, 22050Hz, mono, raw Pulse Code Modulation (PCM).
Fig.3.3. Representation of Vowel Consonant transition in database
3) Pitch Normalization: The pitch must be the same for all the segments in the database. Eventhough we take care of this while recording, practically this is not possible. Partneme segments are created from the recording may have pitch variation. Constant pitch is necessary to avoid pitch-mismatch. A variation of ± 10 is affordable. Pitch normalization is done before segmentation to attain exactly same fundamental
Fig.3.4. Representation of Vowel Consonant transition in database
186
in database, Soffset ,Voffset and Coffset the starting byte of data, vowel and consonant in speech database, Ivi and Ici the index of consonant and vowel in the lookup table, Bn is the byte size for each data. The offset for transitions is calculated similarly but depends on both V and C in the transitions.
For ease of access and management, the database is stored as a single file, a continuous stream of data with a header and data block. The header block holds the address of each partneme is a specific order and the data block holds the size information and sample values of each partneme, stored in the same sequence as in header. Header acts as the look up table for the starting address partneme. During wave generation the synthesizer from the header gets the starting address of each partneme, and copies the required number of samples corresponding to the partneme segment to the buffer. These segments are concatenated to generate the speech output. The database creation is very crucial in concatenative based system.
Selected segments are concatenated to generate the output signal. The synthesizer performs spectral smoothing at concatenation points to remove mismatch and other spectral disturbances. Spectral smoothening is done using proper windowing of the output signal. The window for spectral smoothening is defined as below
W (n) = 1/ 2(1 − cos(pi * n / N )) for0 < n < 0.125N = 1 for0.125N < n < 0.625N
C. Synthesizer The synthesizer identifies the segment to be concatenated from the phonetic string which represents the actual pronunciation. The synthesizer has a token generation module, which generates the token using token generation rules as below, which is used to identify the partneme for concatenation. The speech generation module from the header extract the starting address, reads the data from the database, concatenate the segments by ensuring minimal distortion at the concatenation point. Joining of speech segments is done at epoch synchronous points to ensure minimal distortion at the concatenation points
D. Existing system for mobile android platform Early works on developing a TTS system on a mobile device focused mainly on migrating an existing TTS system from a resourceful platform to a resource-limited platform [8][9]. Since the space was limited diphone synthesis was utilized. Researchers attempted to improve output quality using unit selection synthesis Tsiakoulis, et al[10] used statistical parametric synthesis to make unit selection database small enough for embedded devices without much reduction in quality. Pucher, M and Frohlich[7][11] used a server based approach and transferred the wave to mobile device. Many application based on TTS are available for other languages like English. The scenario is different when comes to Indian languages. A few research development have been carried in Indian language TTS for Mobile Platform. Very less work has been carried out for a Regional language like Malayalam for developing TTS for mobile. A few researches has been done in implementing TTS for Malayalam using Flite but further works has not been done. At present there does not exist any TTS application for Malayalam on android platform, supporting Unicode and proper rendering
The sample for token generation rule is given below. Tokens are generated based on the succeeding and preceding phones. These rules are language specific. These tokens correspond to indexing of segmented partneme voice signals in the speech database header. Tokens generation rules: CVCV Æ C + CV + V + VC + C + V + Vout VCV Æ Vin + V + VC + C + CV + V + Vout CVYV Æ C + CV + V + VY + YV + Vout CVV Æ C + CV + VV + Vout C1C2V Æ C1 + C2 + C2Vin + V + Vout
(1) (2) (3) (4) (5)
III. PORTING TTS TO ANDROID PLATFORM
Vin, Vout, V, C1 and C2 represents fade-in vowel, fade-out vowel, medial vowel which is the steady portion of vowel and consonants.
A. Handling of Malayalam Scripts One of the major challenge in the porting of Malayalam TTS is providing Malayalam script rendering support. Its necessary for the proper rendering of font in application editor. Android system font doesn't have glyph for malayalam thus malayalam characters cannot be rendered or drawn. One of the easier method that can be used is modifying the system font and introducing it in to the system via.adb(Android Debug Bridge), but this requires rooting of the device thus cannot be used. Merely introducing fonts doesnt solve the problem beacuse Malayalam script requires reordering of glyphs. Even though android mobiles support Unicode, proper rendering is not provided. Few models have indian versions
A direct offset calculation method is used for calculation of offset for each segment for reducing the search time in the database. The Offset gives the byte information for finding the samples for the partneme. Offset is calculated using the Token index. The offset for token corresponding to vowel and consonant is calculated as below.
Vioffset − − > Soffset + ( I vi + Voffset ) * Bn ,
(6)
Cioffset − − > S offset + ( I ci + Coffset ) * Bn ,
(7)
(8)
where Vioffset is the byte corresponding to the ith vowel in database, Cioffset is the byte corresponding to the ith consonant
187
server which is updated daliy and the contents will be stored as categoried text. These files can be directly downloaded and played in the mobile.
which supports indian languages. But this is not available for the devices purchased outside india. So in order to make the application accessable from any region, this issue in handling malayalam scripts is done by setting Malayalm font, specific for the application editor using android programming constructs. Reordering of gliphs done based upon reordering rules for malayalam. B. Soft Keyboard interface for malayalam Virtual Keyboard is implemented for malayalam text entry. The malaylam keyboard can be invoked/ keyboard pops up on detecting a long press on the screen. The keyboard follows InScript(IndianScript) keyboard layout. But minor modification to the Inscript layout have been made to make the virtual keyboard fits well to the device screen width.
Fig.5.
News Reader application using Malayalam TTS for android platform
IV. ISSUES IN CURRENT IMPLEMENTATION
Fig.4.
Even though android supports unicode, support of rendering is provided only for Hindi, but not available for other Indian languages. Additional implementation is required for handling Malayalam for the Unicode displaying the contents with proper rendering. Support for bilingual text is not provided in the current implementation. Contents in English are removed before parsing the text to the Synthesizer. So skipping of words occur when bilingual content appears in text. Android does not support proper rendering of malayalam script and by default it doesnot have Indic fonts. But android version 4.1 onwards it gives support for these languages. The additional code incorporated for font mapping shall be removed in the later version
Malayalam Virtual keyboard for android platform
C. TTS Engine for Android Android supports C code using the native compiler Android Native Development Kit(NDK). Using NDK we can create native library. Compiling the speech engine code natively makes the processing faster. The TTS is accessed as native library in the implementation of TTS Malayalam for android and the Unicode string is passed to this native library, for generating speech output
V. CONCLUSION AND FUTURE PLANS This is an initial step for introduciing a regional language (Malayalam) TTS in Android Platform. The TTS engine generates a synthetic speech for any domain input, in a neutral mode. The MOS for current output is 3.2. The app provides support for inputting Malayalam and rendering of Malayalm scripts. The following modifications will be incorporated for impovement of the app. • Softkeyboard which is the standard Input Method Editor (IME) will be implemented to replace the virtual keyboard • A rendering engine will be implemented for handling the issues in rendering of Malayalam scripts
D. Wave play The wave generated by the engine is played using Android Mediaplayer object E. An application for android based TTS: News reader using ESNOLA TTS for malayalam The app can be used as a mobile based News Reader, by integrating a feed extractor with the app. On clicking the link it extracts the news from the online sites and stores their preferences. The text extracted from news sites is given as input for TTS. The text parsing can be replaced by using a
188
• • •
Incorporation of prosody phrasing module and prosody to generate speech output with naturaleness Enhancing of Database for generating speech for bilingual text Handling of Bilingual Text in TTS engine VI. ACKNOWLEDGEMENTS
We would like to thank Dr Shyamal Kumar Das Mandal for providing all support for the development of ESNOLA based TTS for Malayalam. We also like to thank all members of Language technology Center for encouragement and motivation given for us during each phase of the project. VII. REFERENCES [1]
[2] [3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
Shyamal Kumar Das Mandal and Asoke Kumar Datta, “Epoch Synchronous Non Overlap Add (ESNOLA) method based concatenative speech synthesis system for Bangla”. Dr. V. R. Prabodhachandran Nayar, “Syllabification rules: Swanavij~na: nam”. Anand Arokia Rai, Tanuja Sarkar, Satish Chandra Pammi, Kishore Prahallad and Alan W Black, “Text processing for Text To Speech systems in Indian languages”. Dhruv Bhasin,“Keyboard mapping and font rendering techniques for non-latin languages case of android mobile phones,” thesis, San Diego State University. R. Ravindra Kumar, K. G. Sulochana and Jose Stephen, “Automatic generation of pronunciation lexicon for malayalam – A hybid aproach”. H. Zen, K. Tokuda, and A. Black, “Statistical parametric speech synthesis,” Speech Commun., vol. 51, no. 11, pp. 1039–1064, 2009. Heiga Zen, Andrew Senior,and Mike Schuster, “ statustiecal parametric speech synthesis using deep neural networks” google publisher A.W.Black and K A. Lenzo, “Flite a small fast run-time synthesis engine”, in the proceding of the 4th ISCA Workshop on Speech Synthesis, 2001. Hoffmann, R. et al., “ A Mulitlingual TTS System with less than 1MByte Footprint for Embedded Applications”, in the proceeding of ICASSP, 2003. P. Tsaikoulis, A. Chalamandaris, S. Karabetsos, and S. Raptis(2008) « A Statistical Method for Database Reduction for Embeded Unit Selection Speech Synthesis, » submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008), March 30-April 4, Las Vegas, Nevada, USA (2008) M. Pucher, and P Frohlich, “A User Study on the influence of Mobile Device Class, Synthesis Method, Data Rate and Lexicon on Speech Synthesis Quality”, in the proceeding of inter speech 2005. Processing (ICASSP 2008), March 30-April 4, Las Vegas, Nevada, USA (2008)
189