Hindi Spoken Signals for Speech Synthesizer

2016 2nd International Conference on Next Generation Computing Technologies (NGCT-2016) Dehradun, India 14-16 October 2016

Hindi Spoken Signals for Speech Synthesizer G. D. Ramteke

R. J. Ramteke

School of Computer Sciences, North Maharashtra University, Jalgaon, 425 001, MS-India. [email protected]

School of Computer Sciences, North Maharashtra University, Jalgaon, 425 001, MS-India. [email protected] generation in audible form. These aspects are utilized in the interface of text and spoken. The name of the interface is a system of speech synthesizer which is one of most famous speech technologies. Speech synthesis (SS) system translates the text into syntactic generation of speech form [18].

Abstract—The paper presents a speech synthesis (SS) system for Hindi vowel. The system supports two aspects: text processing and voice generation. Text processing belongs to the shape of character and voice generation is in the audible form. With the help of a couple of aspects, the concatenative-based approach has used for SS-system. In the system, any type of vowel with their representation is synthesized using all sound samples. The pitch frequencies of the sound signals were extracted by cepstral pitch detection algorithm in the noiseless environment. The results for statistical techniques are evaluated on pitch reading of each Hindi spoken sample. In order to test a male synthetic voice, MOS (Mean Opinion Score) has been used. The achieved score of MOS value is between fair and good. Ultimately, a satisfactory of the speech synthesizer has been developed for Hindi vowels.

The objective of the work is to design a system which converts Hindi text into synthetic voice form. The text would be any type of vowels and their representation. In noiseless environment, the features of the synthetic voice signals are extracted by cepstral pitch detection technique. The readings of pitch detection are calculated through mean and standard deviation. The paper is summarized as follows: The next section reviews the related-work. Briefly, the section-III discusses on Hindi language. Section-IV proposes the speech synthesis system for Hindi text. The result and discussion explain in section–V. Finally, the present paper concludes in section-VI.

Keywords—Prosody; MOS; Text Corpus; Speech Library; Speech Synthesizer.

I.

INTRODUCTION

Spoken form is a best way of communication among people. In the communication process, spoken language plays a vital role. A lot of spoken languages are available [1]. Few of those languages are used in education system. Typically, learning of spoken languages is an initial stage of primary education in India. In addition to the spoken communication, written part is considered. Written communication is involved any interaction that makes use of the written word [2, 3]. Mostly, such type of communication is used in easy to preserve, representation of complex matter, permanent record, accurate presentation, how to use as a reference and easy to verify.

II.

Voice user interface plays a vital role in man-machine interaction applications. Speech synthesis is one of the applications of voice-user interface and a popular research domain in the world. In the domain of speech synthesis, many researchers and computer scientists have given the contribution for commercial and non-commercial use. Bhuvan Narsimhan et al., 2004 described of schwa-deletion in Hindi for text-to-speech system and how to handle the component of pronunciation using the approach of concatenative [1]. The example of Hindi word was pronounced as {namak} (‘salt’) which represented in the orthography using the characters of consonantal for {n}, {m} and {k}. Diemo Schwarz., 2006 described the concatenative sound synthesis which has been worked on musical domain [2]. Usha Goswami et al., 2008 reviewed of the teaching in all English school [3]. Naim R. Tyson et al., 2009 investigated 95% accuracy on the deletion of schwa from a small corpus of Hindi words [4]. Pamela Chaudhur et al., 2010 suggested the development of TTS (Text-to-Speech) system for Telugu vowels using concatenation approach of pre-recorded speech units [5]. MOS (Mean Opinion Score) for intelligibility has been received better results by 65 listeners. Arun Soman et al., 2011 presented a corpus-driven Malayalam TTS-system using concatenative approach [6]. In the presented system, spoken utterances were automatically produced from Malayalam text. Madiha Jalil et al., 2011 provided an overview of TTSsynthesis by highlighting its component of digital signal processing [7]. It studied various techniques of speech

In age of the communication, a lot of potential problems have been explored for normal and abnormal people such as language differences at region-to-region in one state, hearing problems, voice problems, speech difficulties, issue of lowvision, vision problems, visually disabled and many others. The issue of the normal students or students with speech difficulty and low-vision can be solved using speech technology. A type of speech technology includes text and spoken forms [5]. The study of spoken form is based on the assumption. With the general study of spoken communication, the student examines all aspects of the discipline in a variety of context. Spoken language understanding is a major issue in various systems of human-machine interaction [6]. For primary education system, it needs to develop two aspects: written form and spoken form. The area emphasizes effective written and spoken communication [15]. The written form is based on the shape of character form. Speech form is used for

978-1-5090-3257-0/16/$31.00 ©2016 IEEE

RELATED WORK

323


(Vyanjanas). The vowels and consonants are 11 and 33 respectively.

synthesis: the rule-based, the concatenative, the hidden markov model. Kwan Min Lee et al., 2011 investigated the effect of the user choice on social responses to artificial speech [8]. Lakshmi Sahu et al., 2012 proposed a TTS- system for Hindi, Telugu and Kannada languages [9]. It generated human voice based on concatenative synthesis technique. Catalin Ungurean et al., 2013 implemented a Romanian TTSsystem which has been focused on the intelligibility and naturalness parameters of a synthesized text [10]. Vivek Hanumante et al., 2014 designed Android application which converts English language text into speech form in English, Hindi and Bengali languages for commercial use [11]. Mukta Gahlawat et al., 2014 focused on two parameters: intelligibility and naturalness [12]. First parameter means easily understandable and second parameter means the quality of speech being close to human voice. Theophile K. Dagba et al., 2014 developed a speech synthesis system for Fon language which is a tonal language spoken in Benin [13]. It worked on the unit-selection technique. The system was based on letter-to-sound rules. With the help of concatenative synthesis technique, a system of English-text to MarathiSpeech was implemented using 28,580 syllables [14]. The syllables are a combination of C (Consonant) or V (Vowel). Fiona S. Baker, 2015 explored the text-to-speech software for 24 college students who were facing the reading difficulties in the year of college [15]. Soumya Priyadarsini et al., 2015 dealt with artificial production of speech for Indian languages based on concatenation technique [16]. The existing speech synthesis methods were found to be less effective in Odia, Bengali and Hindi languages. Smita P. Kawachale et al., 2015 illustrated a new approach which helped to reduce corpus size by “syllabic-based concatenative technique for speech synthesis in Marathi language” [17]. Using rule-based approach, such kind of TTS-synthesis system for Marathi numerals has been designed [18]. This system would be benefits for anyone who is able to understand the synthesized number. Saleh M. Abu-Soud, 2016 described a multilingual TTS-system which was based on inductive learning. In the system, three phases have been composed such as analysis phase, learning phase and synthesis phase. It was produced the accurate phones with high level [19]. In the field of speech synthesis, natural voice producing and maintaining the intelligibility are an uphill task for any researchers or computer scientists.

TABLE I.

CLASSIFICATION OF HINDI 12-VOWELS

Classification of Hindi Vowels

1

2

3

4

(Short)

(a)

(i)

(u)

(Long)

(aa)

(ee)

(oo)

(Conjunct)

(e)

(ai)

(o)

(ao)

(Nasal)

(an)

-

-

-

(Visarg)

(ah)

-

-

-

-

In this section, the vowels are covered up [4]. These are organized into five sections: Short, Long, Conjunct, Nasal, Visarg as shown in Table 3.1. Short is a signal vowel (V) in a short word e.g. the short vowels represent (‘a’) and its representation (‘Anaar’ in Hindi, English: pomegranate). Long is a short word or syllable ends with a vowel-consonant (VC) e.g. (‘aa’) and its representation (‘Aam’ in Hindi, English: Mango). Conjunct vowel is a combination of short and long vowels [5, 18]. These phonemes are produced e.g. (‘e’). Nasal vowel is uttered with a low tone so that air pressure through nose as well as mouth e.g. (‘an’). Visarg vowel is hardly ever used e.g. (‘ah’). Text corpus is created manually and the size of text samples is 191 syllables. In the model of speech synthesizer, all Hindi vowels are utilized for analysis and synthesis. IV.

PROPOSED SPEECH SYNTHESIS SYSTEM FOR HINDI TEXT

The role of speech synthesis (SS) system is to perform a syntactic human speech from Hindi text. For the present system, text processing and voice/waveform generation are two pivotal components.

III. BASICS OF HINDI LANGUAGE Language is a system of communication by means of sound or written forms. The language of written form is an attempt to symbolize the spoken language through visual symbols [1]. A couple of languages are used for a verbal exchange between two or more persons. Mostly, Hindi is used for a verbal exchange amongst people in India [2]. It is one of the 22-constiutional languages in India. Hindi is a national language and widely used in India. It is famous as the reliable language which used in many of the private and government sectors in India [3]. A few Indian languages are used in Devnagari script. Hindi is one of them. The writing style for the mentioned language is from left to right. There are 42 alphabets available. The characters are categorised into two unique constituents: vowel (Swaras) and consonants

Fig. 1. Model of Speech Synthesis System for Hindi Text

In the component of text processing, text can be identified the shape of character and analyzed in the phonetic form [6, 7]. Text would be a vowel and its representation. Another

324


component is voice or waveform generation. Normally, voice generation produces the sound signals. In order to process the generation of human-voice, speech signals are analyzed the features of prosodic: pitch and length of speech [8]. It is accessed from speech library as shown in Fig. 1. A couple of components are worked on a concatenative approach which is performed a key role for concatenating the units: text and speech.

model, quality speech signals are utilized [18]. For searching the quality sound units, the unique-Id of text unit is matched with the other unique-Id of quality sound unit [11]. All selected sound units are collected to concatenate it [12]. The entire speech unit is sent to audio system. Audio system is generated the voice. It has ability to convert digital signals into analog signals which can be understood by the mankind of animal.

A. Procedure of Speech Synthesis Engine The engine of speech synthesis is adopted the text using enough speech processing. Memory capacity of speech samples serves to speech synthesis platform. Concatenative speech synthesis approach is used to implement the SS-model for Hindi vowel and its representation based on the rule.

B. Analysis of Prosody The model of prosodic analysis focuses on a new way to enhance speech supporting task. Using the way, it provides the specific duration and pitch information associated with a sequence of sound units in the given text. In order to collect the sound samples, speech synthesizer requires a human voice how to record the sound signals in Hindi language as shown in Fig. 3. Prosody includes length of speech signals related to duration; pitch relates the core signal of the fundamental frequency (F0). All speech signals are stored into speech library. DSP (Digital Signal Processing) assists to extract the prosodic features of actual output of speech-synthesis-system.

The model uses context sensitive rewrite rule. It shows in the form of equation: (1) Where, A is converted d when preceded by B and followed by C. In developed systems, the context B and C can be any length. This approach is being produced in the syntactic voice generation in spite of information difficulties.

Prosodic Analysis

Speech Library

Text Fig. 3. Architecture of Prosody

1) Length of Speech Signals The length of speech signals is a pulse of the frame which is a set of signals. The following equation is to find out length of speech signals as follows:

Text pre-processing Syllable

Comparison Text in Devnagari Script

Text Corpus

Duration= Sampled data of Speech/ Sampling Frequency (2)

Vowel and its Representation

Searching the Sound Signals Unit

Where, duration is total duration of speech signals, sampled data of speech is information of sampled signals, sampling frequency is what actual sampling frequency of speech signals.

Speech Library

Quality of Speech

Concatenation process of Selected Sound Signals

2) Cepstral Pitch Detection Technique Pitch detection is a part of prosody. Speech synthesizer needs continuous fundamental frequency pattern. In this section, pitch or fundamental frequency leads to detect correct pitch values in voiced/ voiceless signals [18]. It denotes F0.

Converted text in form of sound as output

Frame-wise Segmentation

Fig. 2. Flow Diagram for Speech Synthesizer Based on Concatenative with Context Sensitive Rewrite Rule

Fig. 2 describes the flow of each process. The information is recognized into two basic units: text form and sound signals. In text pre-processing, raw text is fed up. The received text is worked on syllable type [13]. A syllable can be a vowel or a consonant or a combination of vowels or consonants. As text is divided into a number of tokens, all tokens are fetched from text corpus. The token can be a set of syllables [9]. The entire set is sent to the next process for comparing. In the level of comparison, each unit of text is used for detecting the exact match. The exact match of text is defined the unique-Id. Similarly, the speech unit is labelled the unique-Id. For listening, the quality speech is essential. Speech library has original and quality speech signals [10]. In the presented

Hamming Window

FFT

Log(X)

Non-silence Value

Peak Value

Voice/ Voiceless Pitch

Silence Detector

Fig. 4. Flow Diagram for Cepstral Pitch Detection Algorithm

Cepstral technique is used to estimate speech frequency. A speech signal consists of different frequencies. The lowest

325


frequency of this harmonic series is known as the fundamental frequency or pitch. The frequency generated by vocal cords in the form of periodic excitation passes via the vocal tract filter. Fig. 4 shows the flow of each activity which is applied on the syntactic voice signals. The voice forms are segmented into frames. The voice signal is the random signals which would be in the form of stationary or non-stationary form [14]. Each frame is passed through a hamming window. A real cepstrum in connection with speech frequency can be estimated. It shows important statistical equations follows as: log AB = log A + log B

(3)

cepstrum = IFFT (log FFT(S))

(4)

C. Mean Opinion Score (MOS) Value In the communication of sound, listener is performing a vital role. Listener likes to hear the quality speech. The challenge issue is measuring the voice quality. Therefore, MOS value is proposed. Table 4.1 shows in one number from 0 to 5. The value of MOS is 0 for very poor and 5 for the excellent. MOS is a subjective test and need a team of listeners [6]. Listeners should be familiar with Hindi language. When the syntactic sound will be generated by a machine, listener would be able to give the judgmental score based on MOS values. Almost, all listeners can understand Hindi language. The summary of listeners is 11 persons (5-Female, 6-Male). Listeners provide the score for examining the MOS test. A number of vowels with their representation are 11. The overall size of total opinions is 121. According to MOS score of listeners, the system has caliber or not.

The log function is defined for real values. FFT (Fast Fourier Transform) is operated for pitch measurement [15]. In frequency domain extracted the feature based on FFT and log. The equation involved the inverse fourier transform (IFFT) which is applied on complex signals the difference. In acoustic terms of the spoken signals, the sampling frequency, FFT spectrum, frame length, frame shift and analysis window are 20 KHz (22,050Hz), 256 pts, 30 ms, 10 ms and hamming window respectively. Additionally, one mono channel of speech signals is set for analysis and synthesis. Cepstral peak value detects the peak in the spectrum. The voice signals are classified as voiced or voiceless. Voiced signal doesn’t match with each other. Another as voiceless is too much closed to silence part of the speech signals [16, 17]. If the sound segmentation is voiced, it will obtain harmonic peaks. These pitch frequencies characterize the individual according to its age and gender. The range of frequency value (F) for male speaker is between 50Hz to 500Hz. The cepstral pitch detection is useful of speech synthesis system [18].

TABLE II.

MEAN OPINION SCORE (MOS) VALUES

MOS Value

V.

Listening Quality

5

Excellent

4

Good

3

Fair

2

Poor

1

Bad

0

Very Poor

RESULT AND DISCUSSION

In order to increase the level of intelligibility and test the quality of synthesized speech, Prosodic test and MOS test are performed. Prosodic test is used for the prosodic features extraction. MOS depends on the human brain which can be judged of the uttered from text.

3) Data acquisition for the text and the speech Data acquisition is an initial step of text-to-speech domain. It is divided two broad sections: text corpus and speech library. As the standardized text corpus is unavailable for primary education domain, text corpus of Hindi vowels was manually made from the books of children. All text was collected in phonetic form. The size of text corpus is 191 syllables. For speech library, the speech samples have been recorded with the help of sound recorder of Windows-7 OS by the electronic device as a mike. All sound samples have been normalized by Praat tool. The tool was created by David Weenink and Paul Boersma for speech analysis; labelling and segmentation of spoken signals [19]. These speeches are stored in wave file format. For implementation, speech synthesizer is essential only one male or female speaker. A male speaker is used for the system. The speaker must be aware of Hindi language. All types of vowels and their representation are pronounced by a male speaker with noisy environment. The age group of a speaker is between 25 to 30. The place for recording of sound is at School of Computer Sciences, North Maharashtra University, Jalgaon. Speech synthesizer needs clear sound for listening to all listeners. The size of a speech library is 99. All sounds are useful for uttering from text.

A. Prosodic test In the experiment, a test is extracted the prosodic features: pitch value and length of sound signals. The actual time of sound signals is a length.

Fig. 5. Speech Waveform with Noise-Free for Hindi Word Pronounced by a MALE Speaker

(Mango)

The features of pitch detection were measured in the statistical approaches: mean and standard deviation. Mean is a sum of fundamental frequency which is already defined the range in the section of cepstral pitch detection. Standard deviation is a measure of the spread of value within a set of fundamental frequency. SD indicates standard deviation.

326


is very clear for intelligibility how many Hindi vowels are correctly recognized. The parameter of speech was identified the score which depends upon a speaker. The announced score was used to predict the awareness of a speaker. TABLE IV. MEAN OPINION SCORE (MOS) VALUES OF LISTENING R ATE (LR) FOR HINDI VOWEL USING A MALE GENERATED VOICE

Fig. 6. Cepstral Pitch Tracking for Hindi Word by a MALE Speaker TABLE III.

Hindi Vowel with their Representation

(Mango) Pronounced

Voice in English Format

PITCH DETECTION OF HINDI VOWELS USNIG A MALE VOICE

Hindi Vowel with their Representation

Voice in English Format

Length of Sound Signals in Seconds

Cepstral Pitch Detection Technique in Hz Mean SD

A Matlab Anar

2.53

293.12

101.25

Aa Matlab Aam

2.53

316.88

117.40

I Matlab Imali

2.47

305.30

101.83

Ee Matlab Edhan

2.65

307.74

112.17

U Matlab Ujala

2.50

310.43

110.31

Oo Matlab Oont

2.23

293.37

111.41

E Matlab Ek

2.28

280.01

103.26

Ai Matlab Ainak

2.24

272.49

116.56

O Matlab Oodhani

2.38

314.26

117.89

Au Matlab Aurat

2.19

289.02

109.18

An Matlab Angur

2.19

296.88

103.80

Duration of Sound Signals

Average MOS Score of Listening Quality by 11 Listeners

A Matlab Anar

2.53 sec.

3.72

Aa Matlab Aam

2.53 sec.

3.36

I Matlab Imali

2.47 sec.

4.09

Ee Matlab Edhan

2.65 sec.

4.45

U Matlab Ujala

2.50 sec.

3.54

Oo Matlab Oont

2.23 sec.

2.90

E Matlab Ek

2.28 sec.

3.54

Ai Matlab Ainak

2.24 sec.

3.09

O Matlab Oodhani

2.38 sec.

4.54

Au Matlab Aurat

2.19 sec.

3.27

An Matlab Angur

2.19 sec.

2.72

Average of MOS score

3.57

C. Comparison with Presented Work On the basis of various SS-models, different authors have worked on the evaluation of generating artificial voice. TABLE V.

In the period of detecting the fundamental frequency, the proposed system is generated in analog sound signals and time domain waveform. Fig. 5 is depicted the output for the same pattern in time domain and Fig. 6 is demonstrated the tracking of pitch using cepstral method. The syntactic generated spoken signals are passed for pitch detection. While detecting the pitch, various undesired signals have been ignored. The clean voice signals are varied the sound level in low and high tone of each Hindi vowels and their representation. It is evaluated the reading of pitch. The estimated pitch detection of the speech signals are shown in Table III. The computed pitch detection of speech signals is considered between 270 Hz to 320 Hz for the mean. Standard deviation is measured between 101 Hz to 120 Hz along with length of sound signals in seconds.

Author Name Bhuvana Narasimhan et al, 2004 Pamela Chaudhur et al, 2010 Arun Soman et al, 2011 Lakshmi Sahu et al, 2012 Catalin Ungurean et al, 2013 Mukta Gahlawat et al, 2014 Smita P. Kawachale et al, 2015 Saleh M. Abu-Soud, 2016

B. MOS test The test is used for observation of the present system. The synthesized voice generation can be judged by listeners whether listening quality is good or bad on. For examined the listening quality, a subjective type question is asked to all listeners based on 6 parameters of MOS-value. On the basis of the input received from listener, the result was declared the score in Table IV. There were 6-females and 5-males. As per criteria of MOS-value, Individual listener has given the score after listening artificial voice. The average score of MOS was 3.57 out of 5 scores for a male voice. The quality of listening

Proposed System

327

A COMPARISON OF THE SPEECH SYNTHESIZERS

Language Hindi

Domain Type Schwadeletion

Prosodic Evaluation of Parameter MOS No No

Telugu

Vowel No Classification

Intelligibility: Good Voice Quality: Fair

Malayalam

General Few Sentences

No

Hindi & Telugu

Limited Domain

No

Listening Quality: Good, Intelligibility: Good Intelligibility: Good

Romanian

Foreign Name

No

English

Natural Sounding

No

Marathi

Syllabic based

Multilingual

Formants Detection: (1-3) No

Regional Varieties and Local Dialects Primary Cepstral Education Pitch Detection

Hindi

Intelligibility: 99.74% (Very Good) Naturalness: Good, Intelligibility: Fair Naturalness: Acceptable No

Listening Quality: Good


The developed work of few authors among the mentioned list of references has been compared with the present work. Using SS-technology, none has worked on the unique domain of primary education. For speech dictionary along with text library, no benchmarking database on primary education domain is available. Own database has been created. Table-V shows the comparison of speech synthesizers. The proposed work is focused on it which reads as follows: Two points of synthesized the speech: listening quality is high. The another point is worked on prosodic analysis; The improvement of the synthesized sound is accuracy using the mankind natural voice; Each speech synthesizer has special domain as similar as the presented work which worked on the domain of basic primary education part as Hindi vowels with their representation. VI.

[4]

[5]

[6]

[7]

[8]

CONCLUSION

The speech synthesizer for Hindi text has been proposed. The present work has performed a key role for converting Hindi isolated vowel into syntactic spoken signals. For the implementation, Hindi vowels with their representation were pronounced by a male speaker. The overall size of text corpus and speech library is 191 syllables and 99 samples of phones respectively. Additionally, the size of the given opinions by 11 persons was 121. The pitch detection of the syntactic sound signals has been evaluated between 270Hz to 320Hz for mean. Standard deviation was measured between 101Hz to 120Hz for pitch readings. MOS test has been computed the score for listening it. In order to full verify the artificial generation of mankind voice, the average score of MOS by 11 listeners was achieved 3.57 out of the 5 score. The received score was between fair and good for natural voice as possible. The achievement of the present work enables to get the Hindi vowels and close to sound of human-kind. In the comparative part, the proposed method outperforms over the other technique proposed. It gives correct result in good quality listening in the MOS evaluation. The proposed model would be a great help to students of visually impaired or normal. Even, teachers will take the advantage while teaching in the classroom.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

ACKNOWLEDGMENT The authors would like to thank the Rajiv Gandhi Science and Technology Commission, North Maharashtra University Centre, Govt. of Maharashtra, India for funding the project (Code No. 7-II-DP on 2014) and G. H. Raisoni Doctoral fellowship, North Maharashtra University, Jalgoan (MHIndia).

[16]

[17]

REFERENCES [1]

[2] [3]

[18]

Bhuvana Narasimhan, Richard Sproat and George Kiraz, “SchwaDeletion in Hindi Text-to-Speech Synthesis”, International Journal of Speech Technology, vol. 7, pp. 319-333, 2004. Diemo Schwarz, “Concatenative sound synthesis: The early years”, Journal of New Music Research, vol. 35, no. 1, pp. 3-22, 2006. Dominic Wyse and Usha Goswami, “Synthetic phonics and the teaching of reading”, British Educational Research Journal, vol. 34, no. 6, pp. 691-710, Dec-2008.

[19]

[20]

328

Naim R. Tyson and Ila Nagar, “Prosodic rules for schwa-deletion in Hindi text-to-speech synthesis”, International Journal Speech Technology, vol. 12, pp. 12-25, 2009. Pamela Chaudhur and K. Vinod Kumar, “Vowel classification based approach for Telugu Text-to-Speech System using symbol concatenation”, International Conference [ACCTA-2010], vol. 1, issue 2, pp. 183-187, Aug-2010. Arun Soman, Sachin Kumar S., Hemanth V. K., M. Sabarimalai Manikandan and K. P. Soman, “Corpus Driven Malayalam Text-toSpeech Synthesis for Interactive Voice Response System”, International Journal of Computer Applications (0975-8887), vol. 29, no. 4, Sept2011. Madiha Jalil, Faran Awais Butt and Ahmed Malik, “A Survey of Different Speech Synthesis Techniques”, IEEE 7 th International Workshop on Systems, Signal Processing and their Applications (WOSSPA), pp. 67-70, May-2011. Kwan Min Lee, Younbo Jung and Clifford Nass, “Can User Choice Alter Experimental Findings in Human-Computer Interaction?: Similarity Attraction Versus Cognitive Dissonance in Responses to Synthetic Speech”, International Journal of Human-Computer Interaction, vol. 27, no. 4, pp. 307-322, 2011. Lakshmi Sahu and Avinash Dhole, “Hindi & Telugu Text-to-Speech Synthesis (TTS) and inter-language text Conversion”, International Journal of Scientific and Research Publications, vol. 2, issue 4, pp. 1-5, Apr-2012. Catalin Ungurean, Dragos Burileanu and Mihai Surmei, “Statistically Augmented Pre-processing or Normalization Module for a Romanian Text-to-Speech System”, IEEE 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1-6, Oct-2013. Vivek Hanumante, Rubi Debnath, Disha Bhattacharjee, Deepti Tripathi and Sahadev Roy, “English Text to Multilingual Speech Translator Using Android”, International Journal of Inventive Engineering and Sciences (IJIES), vol. 2, issue 5, pp. 4-9, Apr-2014. Mukta Gahlawat, Amita Malik and Poonam Bansal, “Natural Speech Synthesizer for Blind Persons using Hybrid Approach”, 5 th Annual International Conference on Biologically Inspired Cognitive Architectures (BICA), vol. 41, pp. 83-88, 2014. Theophile K. Dagba and Charbel Boco, “A Text To Speech system for Fon language using Multisyn algorithm”, 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering – KES2014, vol. 35, pp. 447-455, 2014. Sunil S. Nimbhore, Ghanshyam D. Ramteke and Rakesh J. Ramteke, “Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer”, IOSR Journal of Computer Engineering (IOSR-JCE), vol. 17, no. 1, pp. 34-43, 2015. Fiona S. Baker, “Emerging Realties of Text-to-Speech Software for Nonnative-English-Speaking Community College Students in the Freshman Year”, Community College Journal of Research and Practice, vol. 39, pp. 423-441, 2015. Soumya Priyadarsini Panda and Ajit Kumar Nayak, “An efficient model for text-to-speech synthesis in Indian languages”, International Journal Speech Technology, vol. 18, issue 3, pp. 305-315, Jan-2015. Smita P. Kawachale and J. S. Chitode, “Position based syllabification and objective spectral analysis in Marathi text to speech for naturalness”, International Journal Speech Technology, vol. 18, issue 3, pp. 367-386, Feb-2015. G. D. Ramteke, R. J. Ramteke, “Text-To-Speech Synthesis of Marathi Numerals”, International Journal of Engineering and Technical Research (IJETR), ISSN: 2321-0869 (O), vol. 3, issue 7, pp. 360-367, Jul-2015. Saleh M. Abu-Soud, “ILA Talk: A New Multilingual Text-to-Speech Synthesizer with Machine Learning”, International Journal Speech Technology, vol. 19, no.1, pp. 55-64, Nov-2016. Paul Boersma and David Weenink, “Praat: doing phonetics by computer”, Available link: http://www.fon.hum.uva.nl/praat/