DEVELOPMENT OF BILINGUAL TTS USING ...

32 downloads 0 Views 2MB Size Report
PHONEME ANALYSIS USING POG. 38. 6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE. MODELS. 39. 6.2 ANALYSIS 2:LANGUAGE.
DEVELOPMENT OF BILINGUAL TTS USING FESTVOX FRAMEWORK By

M. S. SARANYA Reg No. 312211412015 of

SRI SIVASUBRAMANIYA NADAR COLLEGE OF ENGINEERING A PROJECT REPORT Submitted to the

FACULTY OF INFORMATION TECHNOLOGY In partial fulfilment of the requirements for the Phase-II Project of

MASTER OF ENGINEERING IN COMPUTER AND COMMUNICATION

ANNA UNIVERSITY CHENNAI 600 025 JULY 2013

ii

BONAFIDE CERTIFICATE

Certified that this project report titled “Development of Bilingual TTS using Festvox Framework ” is the bonafide work of M. S. SARANYA (Reg.no:312211412015) who carried out the project under my supervision. Certified further, that to the best of my knowledge the work reported here in does not form part of any other project report or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate.

SUPERVISOR

HEAD OF THE DEPARTMENT

Dr. T. Nagarajan,

Dr. T. Nagarajan,

Head of the Department & Professor,

Head of the Department and Professor,

Department of Information Technology,

Department of Information Technology,

SSN College of Engineering,

SSN College of Engineering,

Kalavakkam – 603 110

Kalavakkam – 603 110

Submitted for the project Viva- Voce examination held on ………………….

Internal Examiner

External Examiner

iii

ABSTRACT Given a text in one language, a Text-to-Speech (TTS) system is expected to produce high quality speech signal (corresponding to the text provided) which is intelligible to a human listener. Numerous research work had been carried out in the past to build a high quality speech synthesizer for one language. In a country like India, where several languages co-exist, having a speech synthesizer in one language is not sufficient. As a specific case, if we consider Tamil Nadu, most of the people use two languages, namely Tamil and English, for communication. Usage of English words in any Tamil sentence has almost become unavoidable. TTS system developed for Tamil alone cannot handle such kind of texts. So to handle such issues, in this project a bilingual TTS system has been developed using Festvox and HTS framework, specifically to handle two languages namely Tamil and Indian-English (i.e., in our accent). Here, in a given text, the following two cases are to be handled: (i) the text with words only in one of the considered languages and (ii) the text with words in both the languages. In the development of such a system the following issues had been handled: • Normalizing text in both of the languages • Appropriate sub-word unit for each of the languages has been selected. • Database has been pruned to reduce the memory requirement and computational complexity. In the proposed work, apart from the development of Text to Speech Synthesis system, main focus has been given to reduce the footprint-size of the system, by optimally selecting the sub-word unit. Different bilingual TTS systems are developed by sharing acoustically similar sub-word units between the considered languages and one system had been identified as the best, based on perceptional clarity and memory occupied .

iv

திட்டப்பணிச்சுருக்கம் ஒரு

ம஫ொறி஬ில்

மகொடுக்கப்பட்ட

லொக்கி஬த்தை

஫னிைனொல் மைரிலொக

உண஭ப்படும்

஫ொற்றும்

"உத஭த஬ பபச்சொக ஫ொற்றும் முதம"

முதமத஬

என்பொர்கள்.

கடந்ை

Synthesizer)

உருலொக்க பய ஆ஭ொய்

லதக஬ில்,

ஒயி

சிய ஆண்டுகரொக ஒரு சிமந்ை

மகொண்டு,

லடிலப்

பபச்சொக

(Text-to-Speech)

பபச்சுதமத஬

(Speech

ச்சிகள் ப஫ற்க்மகொள்ரப்பட்டுள்ரன .

ம஫ொறிகள் புறக்கத்ைில் உள்ர இந்ைி஬ொதல பபொன்ம நொடுகரில்

, ஒரு

பய

ம஫ொறி

உத஭த஬ ஫ட்டும் பபச்சொக ஫ொற்றும் மைொகுப்பு பபொது஫ொனைொய் இ஭ொது . எனபல, பல்பலறு ம஫ொறிகரின் மசொற்கதரக் மகொண்ட லொக்கி஬த்தைம௃ம் , ஒயி

லடிலப்பபச்சொக

உள்ரது.

஫ொற்மக்கூடி஬ ஒரு அத஫ப்பு இன்மி஬த஫஬ொைைொக

குமிப்பொக நம்

ை஫ிழ் நொட்டில்

஫ட்டுப஫ ப஬ன்படுத்துலது

இல்தய.

உள்ர ை஫ிறர்கள்

ஒவ்மலொரு

,

லொக்கி஬த்ைிலும்

ை஫ிதற ஆங்கியச்

மசொற்கள் ைலிர்க்க முடி஬ொைைொக உள்ரது . எனபல இத்ைிட்டத்ைில், ை஫ிழ் ஫ற்றும் ஆங்கிய லொக்கி ஬ங்கதர லரிலடிலில் இருந்து ை஫ிறர்கரின் உச்சரிப்பில்

ஒயி

லடில஫ொக ஫ொற்மக்கூடி஬ ஒரு அத஫ப்தப உருலொக்குலபை முைற்க்குமிக்பகொரொகும்.

(festvox)

஫ற்றும்

மச஬யொக்கப்படுள்ரது. மகொடுக்கப்படும்

இத்ைிட்டம் ,

மபஸ்டிலல்

(festival),

மபஸ்ட்லொக்ஸ்

HTS எனப்படும் கட்டத஫ப்புகரின் உைலிப஬ொடு அப்படி

லரிலடில

உருலொக்கப்பட்ட

லொக்கி஬ம்

ஏபைனும்

அத஫பிற்கு ஒரு

உள்ர ீட்டொக

ம஫ொறிச்

மசொற்கதர

஫ட்டுப஫ மகொண்டிருக்கயொம் அல்யது இ஭ண்டு ம஫ொறிச் மசொற்களும் இருக்கயொம்.

இத்ைிடத்ைில்

உருலொகப்படும்

"உத஭த஬

கயந்து

பபச்சொக ஫ொற்றும்

அத஫ப்பொனது", ப஫ற்க்கூ஭ப்பட்டுள்ர இரு லதக லொக்கி஬ங்கதரம௃ம் சிமப்பொக தக஬ொர பலண்டும். இந்ை ைிட்டத்ைின் இ஭ண்டொம் பகுைி஬ொக , உருலொக்கப்பட்டுள்ர இரும஫ொறி லொக்கி஬ங்கதரம௃ம் தக஬ொரக்கூடி஬ அத஫ப்பிற்கு பைதலப்படும் ைட அரலொனது, சிமி஬ அரலியொன தக஬கப் மபொருட்கரிலும் உபப஬ொகப்படுத்தும் அரலிற்கு குதமலொ னைொக இருக்க பலண்டும்

.

இந்ை இ஭ண்டொம் இயக்தக

அதடலைற்கு, உகந்ை அரலொன துதண லொர்த்தைத஬ பைர்ந்மைடுக்க பலண்டும் . அைன் பின்னர் ,

இரு ம஫ொறிகளுக்கும் மபொது

லொர்த்தைகதர இதணப்பைன் குமிக்பகொதர அதட஬யொம்.

லொக உள்ர ஒத்ை துதண

மூயம்

நொம் இ஭ண்டொம்

v

ACKNOWLEDGEMENT I thank Almighty God who gave me the wisdom to complete this project. My sincere thanks to our beloved founder Mr. Shiv Nadar, Chairman, HCL Technologies. I also express my sincere thanks to our Principal Dr. S. Salaivahanan, for all the help he has rendered during this course of study. My heartfelt gratitude goes to Dr. T. Nagarajan, Professor and Head of the Department, IT for his words of advice and encouragement. I am deeply obliged and indebted for all the help and guidance he has provided me. I also express my hearty gratitude to the project Coordinator, Dr. S. Srinivasan, Professor, IT and professors of the department for their scholarly guidance. I also thank all the faculty of the IT department for their kind advice, support and encouragement and last but not the least I thank my parents and my friends for their moral support and valuable help.

1

TABLE OF CONTENTS

ABSTRACT

iii

LIST OF FIGURES

4

LIST OF TABLES

6

LIST OF ABBREVATIONS

7

1. INTRODUCTION

8

1.1 SPEECH SYNTHESIS 1.1.1 UNIT SELECTION BASED SPEECH SYNTHESIS SYSTEM

8 8

1.1.2 STATISTICAL PARAMETRIC SPEECH SYNTHESIS APPROACH 9 1.2 TEXT TO SPEECH

9

1.3 OBJECTIVE AND MOTIVATION

10

1.4 OVERVIEW

11

1.5 ABOUT USED TOOLS AND FRAMEWORKS

12

1.5.1 FESTIVAL AND FESTVOX FRAMEWORK

12

1.5.2 HMM TOOL KIT (HTK)

12

1.5.3 HMM BASED SPEECH SYNTHESIS SYSTEM (HTS)

13

1.6 ADVANTAGE OF HTS OVER UNIT SELECTION TECHNIQUE 2. LITERATURE SURVEY

14 15

2 3.SYSTEM BUILDING

18

3.1 DATA COLLECTION

18

3.2 NORMALIZING TEXT DATA

18

3.3 SEGMENTATION AND LABELING

19

3.4 MANUAL CORRECTION

19

3.5 GENERATING MLF

20

3.6 CREATING PHONE LIST

21

3.7 PREPARING QUESTION SETS

22

3.8 GENERATING UTTERANCES

22

3.9 HTS TRAINING AND SYNTHESIS PHASE

23

3.9.1 TRAINING PHASE

23

3.9.2 DECISION TREE BASED CLUSTERING

24

3.9.3 SYNTHESIS PHASE

25

3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH 3.10.1 NEED FOR PHONEME MERGING 3.11 BUILDING BILINGUAL TTS WITH BLINDLY MERGED PHONEMES

25 26 27

3.11.1 NEED FOR FINDING ACOUSTICAL SIMILARITIES AMONG PHONEMES 4. MODEL TRAINING AND VERIFICATION 4.1 VERIFYING TRAINED MODELS 4.1.1 ISOLATED STYLE RECOGNITION

27 28 29 30

3 5. METRICS

33

5.2 CHERNOFF BOUND

33

5.3 KULLBACK LEIBLER DISTANCE ( KLD )

34

5.2 PRODUCT OF GAUSSIAN ( POG )

35

5.5 JUSTIFICATION FOR USING POG

37

6. PHONEME ANALYSIS USING POG

38

6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS

39

6.2 ANALYSIS 2:LANGUAGE.II EXAMPLES WITH TWO LANGUAGE MODELS

43

7. CONCLUSION

48

8. FUTURE WORK

54

BIBLIOGRAPHY

56

4

LIST OF FIGURES FIGURE 1

OVERVIEW OF GENERAL UNIT SELECTION SCHEME

9

FIGURE 2

HTK OVERVIEW

12

FIGURE 3

HMM BASED SPEECH SYNTHESIS SYSTEM

13

FIGURE 4

SCREEN SHOT: MANUAL CORRECTION

20

FIGURE 5

QUESTION SET

22

FIGURE 6

SCREEN SHOT: PROMPT

23

FIGURE 7

SCREEN SHOT : UTTERANCE

23

FIGURE 8

DECISION TREES FOR CONTEXT CLUSTERING

24

FIGURE 9

SCREEN SHOT: ESTIMATED MODEL

29

FIGURE 10 SCREEN SHOT: RE-ESTIMATED MODEL

29

FIGURE 11 SCREEN SHOT: INVENTORY OF "A"

30

FIGURE 12 SCREEN SHOT: WORD DICTIONARY

30

FIGURE 13 SCREEN SHOT : PHONEME DICTIONARY

31

FIGURE 14 SCREEN SHOT: NETWORK STRUCTURE

31

FIGURE 15

ISOLATED STYLE RECOGNITION PROCESS AND OUTPUT

32

FIGURE 16

GAUSSIAN OVERLAPS

36

FIGURE 17

GAUSSIAN CURVE CONSTRUCTION FROM EXAMPLES

37

FIGURE 18

GAUSSIAN OVERLAP

37

5 FIGURE 19

ANALYSIS 1: TAMIL EXAMPLES WITH LANGUAGE MODELS

39

FIGURE 20 TAMIL PHONEME CLASSIFICATION

41

FIGURE 21 BLOCK DIAGRAM OF ANALYSIS I

41

FIGURE 22 ANALYSIS 2: ENGLISH EXAMPLES WITH LANGUAGE MODELS

43

FIGURE 23 ENGLISH PHONEME CLASSIFICATION

45

FIGURE 24

BLOCK DIAGRAM OF ANALYSIS II

45

FIGURE 25

DEFAULT STATE TYING IN HTS

54

FIGURE 26

FORCIBLE STATE TYING

55

6

LIST OF TABLES

TABLE 1

RELATION BETWEEN UNIT SELECTION AND GENERATION

APPROACHES

14

TABLE 2

LIST OF TAMIL AND ENGLISH PHONEMES

21

TABLE 3

REFERENCE VALUES OF ANALYSIS 1

40

TABLE 4

POG RESULT OF TAMIL STOP CONSONANTS /B/ AND /D/

42

TABLE 5

POG RESULT OF TAMIL PHONEMES

42

TABLE 6

REFERENCE VALUES OF ANALYSIS II

44

TABLE 7

POG RESULT OF ENGLISH SEMI-VOWELS

46

TABLE 8

POG RESULT OF ENGLISH PHONEMES

46

TABLE 9

COMMON PHONEMES IN RESULTS OF ANALYSIS I AND II

46

TABLE 10

PHONE LIST BASED ON ANALYSIS I & II

47

TABLE 11

MOS RESULTS OF ALL BILINGUAL SYSTEMS

48

TABLE 12

MEMORY OCCUPIED BY ALL FOUR SYSTEMS

49

7

LIST OF ABBREVATIONS

TTS

Text To Speech

HMM

Hidden Markov Model

HTK

HMM Tool Kit

HTS

HMM based Speech Synthesis System

MFCC

Mel Cepstral Coefficients

MLF

Master Label Files

MGCC

Mel General Cepstral Coefficients

SLF

Standard Lattice Format

USS

Unit Selection based Speech synthesis

IPA

International Phonetic Alphabet

POG

Product Of Gaussian

KLD

Kullback Leibler Divergence

QS

Question Set

8

CHAPTER 1

1. INTRODUCTION In last few decades we witnessed tremendous progress in wide-spread of speech processing applications. Speech signals are composed of sequence of sounds. Speech sounds are sound waves created in moving stream of air. The air is expelled from the lungs, and while passing through the air is chopped by the two vocal cords in the larynx and proceeds upwards. This moving stream can pass through the nasal cavity and emerge through the nose or it can pass through the oral cavity and come out through the mouth. The sequence of sounds produced through humans are simply said as speech production (Lawrence R. Rabiner)

1.1 SPEECH SYNTHESIS Speech synthesis is the process of introducing engineering approaches for “talking” machines that uses a sequence of subword units. The use of such elements provide flexibility for vocabularies as required for applications like text to speech .This speech synthesis can be carried out in two different ways namely, (1) Unit Selection based speech Synthesis (USS) and (2) Statistical parametric speech synthesis technique.

1.1.1 UNIT SELECTION BASED SPEECH SYNTHESIS SYSTEM This is an approach to generate natural sounding synthesized speech waveforms by selecting and concatenating units from a large speech database. (A. J. Black). The text is given as input to the USS system. Firstly, it will transform this input into a string of phonemes required to synthesize the text and it is annotated with prosodic features such as pitch, duration, etc (A. J. Black). The most important issue to be solved to make this concatenative synthesis approach is the selection of appropriate units form database. To get better performance, the database size may increase exponentially if the unit to be concatenated increases. . This method can be achieved by using Festival framework

9

FIGURE 1 OVERVIEW OF GENERAL UNIT SELECTION SCHEME 1

1.1.2

STATISTICAL

PARAMETRIC

SPEECH

SYNTHESIS

APPROACH An another method for speech synthesis is the Statistical Parametric speech synthesis approach. This method is a direct contrast to the USS. Statistical parametric speech synthesis might be most simply described as generating the average of some sets of similarly sounding speech segments. This contrasts directly with the need in Unit-Selection synthesis to restrain natural unmodified speech units, but using parametric models offers other benefits Firstly in statistical parametric speech synthesis system, parametric representations of speech including spectral and excitation parameters are extracted. Then these features are modeled by a set of generative models(e.g., HMMs). Finally a speech waveform is reconstructed from the representations of speech. Although any generative model can be used. HMMs have been widely used. Statistical parametric speech synthesis with HMMs is commonly known as HMM-based speech synthesis system (Heigen Zen).

1.2 TEXT TO SPEECH Text-to-Speech (TTS) synthesis system is a more interative application of speech processing which converts the input, entered in the form of text, to a speech utterance. This synthesis process involves many steps. (Ben Gold) 1

Fig: 1 is cited from (Heigen Zen). Solid lines represent target costs and dashed lines represent concatenation costs

10 The following are few basic components of this process. 1. Text preprocessing: This step translates the text string of characters into a new string with resolved ambiguities. 2. Text to phonetic translation: The processed text is parsed to determine phonetic structure. A sequence of modifications to canonical pronunciations is typically encoded in a series of rewrite rules. 3. Given the sequence of quantities like pitch, duration, amplitude along with unit label, the signal processing component generates speech.

1.3 OBJECTIVE AND MOTIVATION Even though there exists a text to speech synthesis system for English, that is for American and Britain accent and not for Indian accent. Also so far, no TTS synthesis system exists for Tamil. Here, particularly in Tamil Nadu, both Tamil and English words are used so frequently by everyone in their daily life. Having this scenario in mind, English and Tamil are the two languages preferred for bilingual text to speech synthesis system. The main objective of the project is to develop a text to speech synthesis system which takes text of two different languages as input and synthesizes spoken waveform as output by keeping the database size as small as possible. The entered text may be completely in any one of the considered language or it may contain text from both the languages simultaneously. TTS systems can be built with ease, for any language, in FestVox-framework in combination with Festival by using Unit selection based speech synthesis technique 1.1.1 UNIT SELECTION BASED SPEECH SYNTHESIS SYSTEM . or by using statistical parametric synthesis technique, wherein hidden Markov models are used as statistical models for the phonemes to be uttered 1.1.2 STATISTICAL PARAMETRIC SPEECH SYNTHESIS APPROACH. Using these two the main focus of this project work is to build a bilingual speech synthesis system for the languages Tamil and Indian-English.

11

1.4 OVERVIEW In order to make the existing Tamil FestVox-based voice to include the English voice, in the first place, the required text data in English and the corresponding speech data have to be collected. After collecting the speech data, it has to be prepared to make it suitable for building the voice. Here, data preparation means segmenting the speech data into sequence of subword units. Since for the existing Tamil voice, syllable is considered as the subword unit, the English data has also to be segmented into syllables. This procedure is technically termed as deriving time-aligned phonetic transcription. After deriving this, linguistic features for all the unique subword units (here, syllables) have to be defined and the English synthesizer has to be trained. The footprint size of the existing Tamil TTS system itself is very high, in the sense that the memory requirement is around 2GB. In case, if English data is also included in the system, the memory will further be increased and, in general, we can say that the memory requirement for a multilingual speech synthesizer is equal to number of languages multiplied with the memory requirement for individual synthesizer. Our second objective, here, is to reduce the footprint size required for a bilingual synthesizer. Considering syllable as a subword unit for English mandates twice the memory. If we consider a smaller sized subword unit, say phoneme, then the memory required will be increased significantly. Further, instead of considering FestVox framework for building a bilingual synthesizer, if we use statistical parametric synthesizer, which is hidden Markov model-based technique, the memory requirement is expected to be considerably reduced (in the order of MB, instead of GB) with a marginal degradation in the quality of the synthetic speech. In this project, in order to achieve the first objective, English voice will be built considering phoneme as the subword unit to be concatenated. Further, to achieve the second objective, a hidden Markov model speech synthesis system (HTS) will also be built for the languages Tamil and Indian-English. As a first step, we have decided to build HTS-based voices (for which hidden Markov model toolkit is required) for Tamil and English separately, and then to make the system as bilingual. Subsequently, after successful completion of this work, FestVox-based voices for these languages will be built.

12

1.5 ABOUT USED TOOLS AND FRAMEWORKS Here in this project we used two frameworks and two toolkits namely Festival, Festvox, HTK and HTS respectively. As mentioned in section 1.1.1 UNIT SELECTION BASED SPEECH SYNTHESIS SYSTEM Festival

and FestVox can be used for

concatenative speech synthesis technique.

1.5.1 FESTIVAL AND FESTVOX FRAMEWORK Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, and an Emacs interface. Whereas Festvox is a voice building framework which offers general tools for building unit selection voices in new languages. During synthesis the appropriate unit is chosen from multiple instances of that unit based on minimization of joining cost and concatenation cost. Voices generated by this system may be run in the Festival speech synthesis system (Sangal). 1.5.2 HMM TOOL KIT (HTK) HTK is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly general-purpose. Initially HTK was designed for building HMM-based speech processing tools, in particular recognizers. As shown in the picture above, there are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Secondly, unknown utterances are transcribed using the HTK recognition tools.

FIGURE 2 HTK OVERVIEW2

2

Figure 2 is cited from the HTK book (Steve Young)

13

1.5.3 HMM BASED SPEECH SYNTHESIS SYSTEM (HTS) HTS stands for H-Triple-S, which means HMM based Speech Synthesis System. As mentioned in 1.1.2 STATISTICAL PARAMETRIC SPEECH SYNTHESIS APPROACH Although many speech synthesis systems can synthesize high quality speech, they still cannot synthesize speech with various voice characteristics such as speaker individualities, speaking styles, emotions, etc. To obtain various voice characteristics in speech synthesis systems based on the selection and concatenation of acoustical units, a large amount of speech data is necessary. However, it is difficult to collect store such speech data. In order to construct speech synthesis systems which can generate various voice characteristics, with minimal amount of speech data the HMM-based speech synthesis system (HTS) is used.

FIGURE 3 HMM BASED SPEECH SYNTHESIS SYSTEM3

The above figure shows the HTS system overview. In the training part, spectrum and excitation parameters are extracted from speech database and modeled by context dependent HMMs. In the synthesis part, context dependent HMMs are concatenated according to the text to be synthesized. Then spectrum and excitation parameters are generated from the HMM by using a speech parameter generation algorithm. Finally, the excitation generation module and synthesis filter module synthesize speech waveform using the generated 3

Figure 3 is cited from (Keiichi Tokuda)

14 excitation and spectrum parameters. The attraction of this approach is in that voice characteristics of synthesized speech can easily be changed by transforming HMM parameters (Keiichi Tokuda).

1.6 ADVANTAGE OF HTS OVER UNIT SELECTION TECHNIQUE Even though the Unit Selection based speech synthesis system produces better quality over the buzzy output of HTS system, HTS system has many advantages over the USS system like smooth and stable output with various voices and majorly has a small run-time data. USS can produce only fixed voice and needs large run-time data. It also have discontinuity due to concatenation in many cases. (Keiichi Tokuda) TABLE 1 RELATION BETWEEN UNIT SELECTION AND GENERATION APPROACHES4

UNIT SELECTION

HTS

Clustering (Possible use of HMM)

Clustering (use of HMM)

Multi-template

Statistics

Single tree

Multiple tree (Spectrum, F0, Duration)

Advantage:

Disadvantage:



High quality at waveform level

Disadvantage:

4



Vocoded speech (buzzy)

Advantage:



Discontinuity



Smooth



Hit or Miss



Stable



Large run time data



Small run time data



Fixed Voice



Various voices

This table is cited from the paper (Keiichi Tokuda)

15

CHAPTER 2

2. LITERATURE SURVEY Before 1980, research in speech synthesis was limited to the large laboratories that could afford to invest the time and money for hardware. By the mid-80s, more labs and universities started to join in as the cost of the hardware dropped. By the late eighties, purely software synthesizers became feasible; the speech quality was still decidedly inhuman (and largely still is), but it could be generated in near real-time. Recently many researchers have shown interest in the field of speech processing they have carried out numerous laboratory experiments and studies to illuminate the darkness of this field. Some of their findings and suggestions are reviewed here. Festival Speech Synthesis Systems developed by University of Edinburgh is a free software multi-lingual speech synthesis workbench that runs on multiple-platforms offering black box text to speech, as well as an open architecture for research in speech synthesis. It designed as a component of large speech technology systems. Festival is primarily designed as a component of a larger speech application hence we provide a number of APIs. It can be used to simply synthesize text files but to people who think command line interfaces are quaint you might find this not what you are looking for. As mentioned by the developers of Festival framework, it works based on Unit Selection based Synthesis technique. After the demonstration of general unit selection synthesis in English in Rob Donovan's PhD work [donovan95], and ATR's CHATR system ([campbell96] and [hunt96]), by the end of the 90's, unit selection had become a hot topic in speech synthesis research. However, despite of examples it is working excellently. Generalized unit selection is known for producing very bad quality synthesis from time to time. As the optimal search and selection algorithms used are not 100% reliable, both high and low quality synthesis is produced and many difficulties still exists in turning general corpora into high-quality synthesizers as of this writing (Alan W. Black). We can now build synthesizers of (probably) any language that can produce recognizable speech, with a sufficient amount of work; but if we are to use speech to receive

16 information as easily when we're talking with computers as we do in everyday conversation, synthesized speech must be natural, controllable and efficient. The Festvox project aims to make the building of new synthetic voices more systematic and better documented, making it possible for anyone to build a new voice. It is distributed under a free software license similar to the MIT License. Festvox is a suite of tools for building synthetic voices for Festival. It includes a step-by-step tutorial with examples in document called "Building Synthetic Voices". The HMM-based Speech Synthesis System (HTS) has been developed by Keiichi Tokuda, Keiichiro Oura, and Et.al in Edinburg University, has two phases as mentioned in 1.5.3 HMM BASED SPEECH SYNTHESIS SYSTEM (HTS)Immense and infinite number of works has been carried on and with HTS and we can achieve considerably good quality and promising results at the end of this technique. Apart from synthesis HTS plays a vital role in adaptation and recognition also. The speaker adaptation work has been also carried out by Masatsune Tamura , Takashi Masuko , Keiichi Tokuda , and Takao Kobayashi and y

y

yy

y

.

published as Speaker Adaptation for HMM based speech synthesis system using MLLR The Bell Labs Approach is the first monograph-length description of the Bell Labs work on multilingual text-to-speech synthesis. Every important aspect of the system is described, including text analysis, segmental timing, intonation and synthesis. There is also a discussion of evaluation methodologies, as well as a chapter outlining some future areas of research. While the book focuses on the Bell Labs approach to the various problems of converting from text into speech, other approaches are discussed and compared. Thus, this book serves both the function of providing a single reference to an important strand of research in multilingual synthesis, while at the same time providing a source of information on current trends in the field. In a paper titled “From Multilingual to Polyglot Speech synthesis” by Christof Traber, Karl Huber and et.al, a distinction between existing multilingual synthesis systems and mixed-lingual or polyglot synthesis systems has been discussed. Polyglot people are not only able to switch between languages but they can do so in the midst of a sentence and they can recognize when a language change is required.

17 In pure monolingual speech synthesis any foreign words must be adapted to the monolingual phonetic alphabet. Whereas in simple multilingual speech synthesis, for every language, distinct algorithms and distinct databases are used. Switching language means switching the synthesis program or process. This can only be done between sentences. Language switching is usually accompanied by voice switching. Polyglot speech synthesis also known as mixed-lingual speech synthesis (with a preselected primary language): The main language of a document is identified as the primary language of the synthesizer (either automatically or preselected). The synthesizer uses this as the default language until a foreign language word or expression is encountered, which will be pronounced appropriately. Some assimilation may take place. Intonation will be more or less assimilated to the primary language. Foreign language words will be recognized by lexical means, phonetic/compositional analysis and by using sub grammars for recognizing foreign language constructs. (Christof Traber) S.Saraswathi and R. Vishalakshy proposed the TTS system for Tamil, Telugu and English in 2013 in a paper titled as “ Design of Multilingual Speech Synthesis System”. Here they used the concatenative speech synthesis where the segments of recorded speech are concatenated to produce the desired output. They also apply prosody modeling which makes the synthesized speech sound more like human speech. Smoothing is also done to smooth the transition between segments in order to produce continuous output. The Optimal Coupling algorithm is enhanced to improve the performance of speech synthesis system (S.Saraswathi). Bilingual Text to Speech Synthesis system has been developed for Hindi and Telugu voices using Festvox by S. P. Kishore, Rajeev Sangal and M. Srinivas in the proceedings of ICON-2002 as for a “Talking Tourist Aid” application for hand held devices. But here the vocabulary needed is not much bigger and hence they built it using USS method with limited recorded voice database

18

CHAPTER 3

3.SYSTEM BUILDING The first objective of this paper is to build a bilingual Text to Speech Synthesis system by considering Tamil and Indian accent English as the primary languages. To achieve this first goal HTS is used since it gives promising results and has many advantages over Unit-Selection based Speech Synthesis technique as mentioned in TABLE 1 RELATION BETWEEN UNIT SELECTION AND GENERATION APPROACHES ADVANTAGE OF HTS OVER UNIT SELECTION TECHNIQUE

in Section 1.6

The

system

building involves some sequence of procedures and each are discussed below elaborately in an orderly passion

3.1 DATA COLLECTION For developing a TTS system, we need the recorded speech content, generally called as speech corpus which serves as a database. The text-to-Speech synthesis process needs two types of data namely, 1) Text data and 2) Speech data. The former one is the text form of the content which is to be recorded to get the later form of data. The data has to be collected for both Tamil and English language and initially one hour of speech data is collected from the text data. The contents for Tamil language is collected from the Tamil historical novel named “Ponniyin Selvan” and for English CMUArtic corpus data is used. Then the collected text data for both the languages is given to a professional polyglot female speaker and it is recorded in studio environment with the sampling rate of 16KHz and 16 bit quantization. This recorded speech serves as the speech corpus for our TTS system.

3.2 NORMALIZING TEXT DATA After collecting the speech data, the text data was normalized. Normalizing the text data is needed to avoid ambiguity while converting the actual text to the sequence of phonemes. Phonemes are the subset of phones (all possible sounds used in world’s all languages) designed to cover the set of sounds in one particular language. Any sound of a language can be represented as a phoneme and a system called International Phonetic

19 Alphabet (IPA)has been developed by linguistics and phoneticians

for this purpose.

The Tamil text had been transliterated to English. The normalized and transliterated texts of both the languages are converted to a sequence of phonemes Because here phoneme is considered as the basic building unit of this bilingual TTS system.

3.3 SEGMENTATION AND LABELING The recorded speech is spliced into sentences and there are 1132 English sentences and 547 Tamil sentences, which constitutes approximately an hour of data for each language separately. Then each waveform of every sentence has been segmented and labeled based on its spectrogram and the audio perception of that segment. Notations are given to each segmented sound in the sentence. This results in the formation of label files for each wave file. This can be achieved in two ways, either by flat start or by using the labels of some other corpus which has same sounds. This process has been carried out by Forced Viterbi alignment, which is an adaptive learning algorithm.

3.4 MANUAL CORRECTION Forced Viterbi alignment is used for segmentation and labeling for the first time. Approximately five minutes of data are collected from the result of the above mentioned alignment, and corrected manually by adjusting the starting and ending of sound boundaries using visual (spectrogram)and audio perception of the signal. This five minutes of data should contain at least one sample for all possible sounds in that language. Since the forced Viterbi alignment used is an adaptive algorithm it learns the boundaries of the segments adjusted and will apply to that sounds if it sis encountered in some other sentences. For getting more appropriate boundaries for sounds this procedure has to be repeated for few number of time.(say 5 times).

20

FIGURE 4 SCREEN SHOT: MANUAL CORRECTION

3.5 GENERATING MLF Master label file abbreviated as mlf contains all label files in it. It contains the entire path of the label file from the current working directory followed by the list of phones in it. Termination of each and every file is indicated by “.”. To use a training tool with isolated word data may require the generation of hundreds or thousands of label files each having just one label entry. Even where individual label files are appropriate, each label file must be stored in the same directory as the data file it transcribes, or all label files must be stored in the same directory. MLFs can do two things. Firstly, they can contain embedded label definitions so that many or all of the needed label definitions can be stored in the same file. Secondly, they can contain the names of sub-directories to search for label files. In effect, they allow multiple search paths to be defined. Both of these two types of definition can be mixed in a single MLF.

21

FIGURE 5 MASTER LABEL FILE

3.6 CREATING PHONE LIST As discussed in section 3.2 NORMALIZING TEXT DATA the set of unique sound categories used by a language are called phonemes. These phonemes are collected from the label files which are created as discussed in section 3.3 SEGMENTATION AND LABELING by collecting and sortingout unique phonemes. This results in 39 phonemes for Tamil and 41 phonemes in English including the silence (SIL). The extracted phonemes are listed below. TABLE 2 LIST OF TAMIL AND ENGLISH PHONEMES

Tamil Phonemes அ - /a/

ஹ் -

/h/

English Phonemes

எ - /o/

உ- /u/

/a/

/E/

/m/

/u/

ஆ- /A/

இ - /i/

ஏ - /O/

ஊ -/U/

/A/

/eh/

/nq/

/U/

஍ - /AI/

ஈ - /I/

ப் - /p/

வ் - /w/

/ae/

/er/

/o/

/v/

ஐ - /AU/

ஜ் - /j/

ஞ் - /qn/

ய் - /y/

/ah/

/f/

/oy/

/w/

ப்v - /b/

க் - /k/

ங் - /qN/

ழ் - /z/

/AI/

/g/

/p/

/y/

ச் -

/ch/

ல் - /l/

ர் - /r/

/SIL/

/ao/

/h/

/qN/

/Z/

ட்v - /d/

ள் - /L/

ற் - /R/

/AU/

/i/

/r/

/zh/

த்v - /dh/

ம் - /m/

ஸ் -

/s/

/b/

/I/

/s/

/SIL/

஋ - /e/

ந் - /n/

/sh/

/ch/

/j/

/sh/

஌ - /E/

ன் - /nq

ட் - /t/

/d/

/k/

/t/

க்v - /g/

ண் - /N/

த் - /th/

/dh/

/l/

/th/

ஷ் -

22

3.7 PREPARING QUESTION SETS HMM based speech synthesis system uses tree based clustering technique for to classify different phonemes and to group similar phonemes. Question sets(QS) are to be prepared manually by the user. it is necessary to include questions referring to both the right and left contexts of a phone. The questions should progress from wide, general classifications (such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone. Ideally, the full set of questions loaded using the QS command would include every possible context which can influence the acoustic realization of a phone, and can include any linguistic or phonetic classification which may be relevant. There is no harm in creating extra unnecessary questions, because those which are determined to be irrelevant to the data will be ignored.

FIGURE 5 QUESTION SET

3.8 GENERATING UTTERANCES Utterances for each sentence in speech data has to be generated. Before generating utterances(.utt), prompts had been generated using festival framework. The normalized phonetic transcripted data is given as argument to generate the prompts which also takes lexicon and tokenizer as other arguments. Prompts acts as a template and splits the entire sentence into phonemes with equal duration. Then Utterances are generated by comparing with the label files which has actual duration of each phoneme in that sentence. If the

23 sequence of phonemes in prompts and utterances are not matched to each other, it throws an align mismatch error during full label generation in HTS.

FIGURE 6 SCREEN SHOT: PROMPT

FIGURE 7 SCREEN SHOT : UTTERANCE

3.9 HTS TRAINING AND SYNTHESIS PHASE As discussed in section 1.5.3 HMM BASED SPEECH SYNTHESIS SYSTEM (HTS) has two phases namely 1) training phase and 2) synthesis phase.

3.9.1 TRAINING PHASE In training phase, HTS takes question set and the utterances generated in festival framework along with the wave files of each sentence. Training phase will extract three basic features from the wave form namely, mgc, lf0, and cmp. Mel-General Coefficients are extracted by

24 HTS from the wave stored in data. Totally it extracts 105 mgc coefficients along with fundamental frequency (lf0) and cmp. After extracting these features the HTS will start to generate mono labels and full labels. Mono labels are the labels similar to the label which we create in Section 3.3 SEGMENTATION AND LABELING whereas full labels include many other information like previous and next phonemes their features, its position in the sentence, word etc. Now the duration information must be removed from the generated full label files so that the phonemes in the dynamically given sentence cannot be influenced by the duration feature of the phoneme in example sentence. Fewer files are kept for synthesis and about eighty five percentage of the sentences in database are given for training phase.

3.9.2 DECISION TREE BASED CLUSTERING In this training phase, the HTS takes the question set, extracted mgc, lf0, and cmp features, full and mono labels as input and will classify and cluster the phonemes using TreeBased Clustering. The questions in question set serves as nodes in tree based clustering and based on the questions the similar phonemes are clustered at leaf levels. The decision tree based clustering mechanism which provides a similar quality of clustering but offers a solution to the unseen triphone problem.

FIGURE 8 DECISION TREES FOR CONTEXT CLUSTERING5

5

Figure 5 is cited from (Keiichi Tokuda)

25 Decision tree-based clustering is invoked by the command TB and it uses loglikelihood criterion. It only supports single-Gaussian continuous density output distributions. Secondly, TB command can also be used to cluster whole models. A phonetic decision tree is a binary tree in which a yes/no phonetic question is attached to each node. (Steve Young). Depending on each answer, the pool of states is successively split and this continues until the states have trickled down to leaf-nodes. All states in the same leaf node are then tied.

3.9.3 SYNTHESIS PHASE In the synthesis part of HTS, first, an arbitrarily given text to be synthesized is converted to a context-based label sequence. Second, according to the label sequence, a sentence HMM is constructed by concatenating context dependent HMMs. State durations of the sentence HMM are determined so as to maximize the output probability of state durations, and then a sequence of Mel-Cepstral Coefficients and log f0 values including voiced / unvoiced decisions is determined in such a way that its output probability for the HMM is maximized using the speech parameter generation algorithm. The main feature of the system is the use of dynamic feature: by inclusion of dynamic coefficients in the feature vector, the speech parameter sequence generated in synthesis is constrained to be realistic, as defined by the statistical parameters of the HMMs. Finally, speech waveform is synthesized directly from the generated Mel-Cepstral Coefficients and f0 values by using the MLSA filter.

3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH By following the procedures discussed so far, HTS based speech synthesis system has been developed for Tamil, and English separately. This systems cannot handle texts with other language words. And to achieve the first goal of our project of developing a bilingual system which takes input sentence which contains words from both the languages, the following steps are needed to be followed. 1)

The new phone list is created for this bilingual TTS by merging all the phonemes of

both the languages. To differentiate the phonemes of Tamil and English, a language tag had been attached. The language tag is nothing but an alphabet used to identify the language of

26 the phoneme. Here “x” and “v” had used as tag for English and Tamil phonemes respectively. The new phone list is as follows: xa

xA

xae

xah

xAI

xao xAU xer

… va vA

vAI

vAU ….

2)

Separate Vowel list and CV list (Consonant and Vowel Pair) has to be formed.

3)

New utterances are generated for this phone list as discussed in chapter 3.8

GENERATING UTTERANCES using Festival. 4)

These utterances are copied and using this utterances labels are generated in HTS as

discussed in Chapter 3.9.1 TRAINING PHASE. Out of total 1132 and 547 sentences in English and Tamil corpus, 1112 and 527 sentences are given for training phase. remaining sentences are used for synthesis phase. 5)

Question set has been prepared for the new tagged phone list as discussed in Section

3.7 PREPARING QUESTION SETS. 6)

Two different systems has been built with same inputs by including and excluding

syllable information. After the synthesis phase out domain sentences can also be synthesized by following same procedure as discussed so far. Syllable information is the additional information that explains some more features of the phonemes like, number of phonemes in a word or a sentence, the position of a phoneme in a word ( beginning, middle or end), etc. The sentences synthesized with syllable information has more naturalness and clarity than the one without syllable information. The system without syllable information occupies 3696KB, where as the system with syllable information occupies 3972KB. Even though the inclusion of syllable information increases the memory size, it is negligible because the variation occurs only up to few Kilo Bytes.

3.10.1 NEED FOR PHONEME MERGING The bilingual system developed in Chapter 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH, occupies twice the memory of the TTS system developed for Tamil and English separately. To achieve the second motto of the project, i.e., to reduce the foot-print size, the possible phonemes of both the languages has to be merged.

27

3.11 BUILDING BILINGUAL TTS WITH BLINDLY MERGED PHONEMES As discussed in Chapter 3.10.1 NEED FOR PHONEME MERGING, to reduce the foot-print size of the bilingual system, the phonemes of both the languages has to be merged on some criteria. Firstly, the phonemes of two considered languages ( Tamil and English ) are merged based on the representation similarities. That is, some phonemes in both the languages had given same notation based on the perception during segmentation and labeling ( Chapter 3.3 SEGMENTATION AND LABELING ). Assuming these phonemes are similar to each other, a new bilingual system has been built by merging and sorting unique phonemes among both the languages. By doing so, a new phone list with 49 phonemes ( 30 common phonemes + 9 unique Tamil phonemes + 10 unique phonemes of English ) has been obtained. To develop this bilingual system, the same procedure from step 1 to 8 has to be followed as discussed in section 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH. Here also the system with and without syllable information shows minor variation in memory occupied. The phone list used for building a bilingual system with blindly merged phonemes are listed below. a

A

ae

ah

AI

ao

AU

b

ch

d

dh

e

E

eh

er

f

g

h

i

I

j

k

l

L

m

n

N

nq

o

O

oy

p

qn

qN

r

R

s

sh

SIL t

th

u

U

v

w

y

z

Z

zh

3.11.1 NEED FOR FINDING ACOUSTICAL SIMILARITIES AMONG PHONEMES The bilingual system developed with blindly merged phonemes have some problems like cliques and glitches, influence of one language in another language etc. For example, in

28 the blindly merged bilingual TTS, the phoneme “/AI/” of Tamil and English has been merged. This makes the Tamil word with “/AI/” phoneme (஍) to pronounce as English /AI/ sound. This kind of language influence should not occur. To handle such issues, only the acoustical similar phonemes are to be merged instead of merging phonemes based on textual representation. To identify the acoustical similarities between the phonemes of two different languages, some metrics are needed to quantitate similarities or dissimilarities between them. Such metrics and procedures are discussed in Chapters 5. METRICS and 6. PHONEME ANALYSIS USING POG respectively.

4. MODEL TRAINING AND VERIFICATION So far the first motto to synthesize the bilingual speech is achieved. To achieve the second goal of reducing the foot print size we are in need to identify acoustically similar phonemes and to merge them. The wave form of the phoneme examples won’t be saved and instead models of each phoneme is saved. Model training is carried with the help of HTK. Mel-frequency Cepstral Coefficients are extracted from the waveforms. Totally 39 mfcc values are extracted. First thirteen are the actual mfcc and the second thirteen set of values are the first differentiation of the first set of values and the remaining thirteen values are the second differentiation values of the first thirteen mfcc. The first and second differentiation values are needed for the purpose of smoothing the difference between different consecutive models used to concatenate during USS synthesis method. Before model training, prototype with number of states and number of mixture models has to be specified. The number of mixture models and the number of states are user defined based on the number of examples for each phoneme in the database. Initially the models are estimated with zero mean and unit variance. Later the same models are reestimated where the actual values of each model will be extracted.

29

FIGURE 9 SCREEN SHOT: ESTIMATED MODEL

FIGURE 10 SCREEN SHOT: RE-ESTIMATED MODEL

4.1 VERIFYING TRAINED MODELS After the model training, trained models are verified by giving all examples of the phoneme from the database, to the trained model. This will give likelihood. The higher likelihoods shows that the model are trained properly, whereas lower likelihoods are the proofs for improper models.

30 For the purpose of verification, the individual examples of each phoneme had been passed to the trained model and the model recognizes it by the process called isolated style recognition.

4.1.1 ISOLATED STYLE RECOGNITION As we discussed in chapter 4.1 VERIFYING TRAINED MODELS, the individual examples of each and every phoneme in the speech corpus is collected separately. The collection of all occurrences of a phoneme in the database is called as the inventory of that phoneme. The inventory has the following structure.

FIGURE 11 SCREEN SHOT: INVENTORY OF "A"

Apart from inventories, isolated style recognition also needs phoneme dictionary and network. Dictionary literally means giving explanation or meaning for a word or something. In building a system with word as a sub word unit, we will build word dictionary which contains equivalent sequence of phonemes in that word. Only one space should be included in between the actual subword unit of the word and its meaning. But here, we use phoneme as the sub word unit and hence the each phoneme as itself as its meaning. The structure of word dictionary and phoneme dictionary are shown below.

FIGURE 12 SCREEN SHOT: WORD DICTIONARY

31

FIGURE 13 SCREEN SHOT : PHONEME DICTIONARY

As mentioned above, another need for isolated style recognition is the phoneme network. A network is defined using HTK Standard Lattice Format (SLF). An SLF network is just a text file and it can be written directly with a text editor or a tool can be used to build it. HTK provides two such tools, HBuild and HParse. These both takes textual description as input and outputs an SLF file. Whatever method is chosen, network SLF generation is done off-line and is part of the system build process. An SLF file contains a list of nodes representing words and a list of arcs representing the transitions between words. The transitions can have probabilities attached to them and these can be used to indicate preferences in a grammar network. They can also be used to represent bigram probabilities in a back-off bigram network and HBuild can generate such a bigram network automatically. To build an SLF file, a HTK recognizer requires a dictionary to supply pronunciations for each sub word unit in the network and a set of acoustic HMM phone models which has been created already.

FIGURE 14 SCREEN SHOT: NETWORK STRUCTURE

32 With the generated inventory, phoneme dictionary and network, isolated style recognition process is carried out. Isolated style recognition is the process of making the system to recognize each and every phoneme exactly when it is given alone without any other contents before or after it. Because we have to train the system first to recognize every individual phoneme as such, so that it can identify it when it occurs along with the combination of other phonemes. It can be compared with training a person with word by word and then making him to speak a whole sentence or recognizing individual word when a whole sentence is uttered. This isolated style recognition is achieved with the help of HTK command by passing configuration files, mfc files, network, phoneme dictionary and unique phone list as arguments. This recognition procedure has to be carried out for all phonemes of both languages separately and it looks like this.

FIGURE 15 ISOLATED STYLE RECOGNITION PROCESS AND OUTPUT

33 CHAPTER 5

5. METRICS To achieve the second goal of reducing the memory size, the acoustically similar phonemes among different languages has to be identified and merged. In order to identify the acoustically similar phonemes, the similarities and dissimilarities among phonemes has to be measured and calculated. Four different metrics had been measured so far to measure the similarities among the subword units(here phonemes) 6. The most popular and well known metrics used for classifying phonemes based on their features are 1. Chernoff Bound 2. Bhattacharyya Bound 3. Kullback Leibler Bound 4. Product of Gaussian ( POG )

5.2 CHERNOFF BOUND The simplest case of Chernoff bounds is used to bound the success probability of majority agreement for n independent, equally likely events. Here it is used for the process of finding maximum similar phonemes even though they are independent. This bound is based on Bayesian theorem and Chebyshev inequality as below. 𝑚𝑖𝑛 𝑎, 𝑏 ≤ 𝑎𝛽 𝑏1−𝛽 , Assume 𝑎 ≥ 𝑏 then

𝑎, 𝑏 ≥ 0, 0 ≤ 𝛽 ≤ 1

𝑏 ≤ 𝛼 𝛽 𝑏1−𝛽 ≤

𝑎 𝛽 𝑏

b is valid always.

Probability of upper bound for minimum error is as follows. 𝑃 𝑒𝑟𝑟𝑜𝑟 ≤ 𝑃𝛽 𝜔1 𝑃1−𝛽

6

𝑝𝛽 (𝑥|𝜔1 𝑝1−𝛽 (𝑥|𝜔2 )𝑑𝑥

These first three metrics are studies and cited from (Richard O. Duda) and fourth metric is cited from (T. Nagarajan)

34 The integral limits are open here because this is common for all feature space and thus there is no need for boundaries. If this part, 𝐾 𝛽 =

𝑝𝛽 𝑥|𝜔1 𝑝1−𝛽 𝑥|𝜔2 𝑑𝑥 = 𝑒 −𝑘

𝛽

then

𝛽 1−𝛽 1 |𝛽∑1 + (1 − 𝛽)∑2 | (𝜇2 − 𝜇1 )𝑡 [𝛽∑1 + (1 − 𝛽)∑2 ]−1 (𝜇2 − 𝜇1 ) + 𝑙𝑛 2 2 |∑1 |𝛽 |∑2 |1−𝛽

The advantage of Chernoff bound is that optimization is done in 1D-β space and gives exponentially decreasing bounds on tail distribution of sum if independent random variable. 5.2 BHATTACHARYA BOUND: Bhattacharya bound is the special case of Chernoff bound as the later one depends on β value, the bounds are loose for extreme values of β (β=0 and β=1) and bound is tight for intermediate values. Optimality of bound depends on 1. parameters of distribution and 2. prior probabilities. Slightly less bound can be achieved with simple computations by setting β=1/2 and Chernoff bound with β value ½ is called Bhattacharya Bound. 𝑃 𝑒𝑟𝑟𝑜𝑟 ≤ =

𝑃 𝜔1 𝑃(𝜔2 ) 𝑃 𝜔1 𝑃(𝜔2 )𝑒 −𝑘

1 1 ∑1 + ∑2 𝐾 = (𝜇2 − 𝜇1 )𝑡 2 8 2

−1

𝑝 𝑥|𝜔1 𝑝(𝑥|𝜔2 ) 1 2

1 |𝛽∑1 + (1 − 𝛽)∑2 | (𝜇2 − 𝜇1 ) + 𝑙𝑛 2 |∑1 |𝛽 |∑2 |1−𝛽

5.3 KULLBACK LEIBLER DISTANCE ( KLD ) This bound is based on distance metric and hence it is also called as Kullback-Leibler distance or divergence. It considers the visible units under consideration as training set. Desired probability of visible units are taken as Q(α) and actual probabilities of those units are considered as P(α). Then the Kullback Leibler bound is defined as follows

35 𝐷𝐾𝐿 𝑄 𝛼 , 𝑃 α

=

𝑄 α log α

Q(α) P(α)

The value of DKL is always non-negative. But it can be zero if Q(α) = P(α). This bound is used measure the distance between phonemes of different languages and the phonemes with minimal distance can be merged as a single phoneme.

5.2 PRODUCT OF GAUSSIAN ( POG ) This technique is proposed for bias removal using log likelihood space in (T. Nagarajan). Here, log-likelihoods are adjusted by modifying the topology of the corresponding model in such a way that the bias between a pair of HMM is reduced. As discussed in (T. Nagarajan) the bias of any of these models may be visualized in the likelihood-space as an overlap between Gaussian likelihoods of different models (classes). Phonemes of two different classes share some in features, and the confusion error cannot be j

avoided. Let us consider the feature vectors of two different classes (Ci and Cj ) as xki and xk . Let Ai and Aj be the models of the classes, Ci and Cj , respectively. Let the likelihoods of the j

feature vectors of the class Ci for the given models Ai and Aj be p(xki |λi ) and p(xk |λj ) respectively. We can assume that these likelihoods are distributed normally in likelihood space with suitable parameters. Let these two Gaussians be Nii μii , σ2ii and Nji ( μji , σ2ji ). Similarly, for the feature vectors of the another class Cj , the likelihood-Gaussians will be Njj μii , σ2jj and Nij ( μji , σ2ij ). The product of Gaussian will be as follows, 𝑁𝑘 𝜇𝑘 , 𝜎𝑘2 = 𝑁𝑖𝑖 𝜇𝑖𝑖 , 𝜎𝑖𝑖2 . 𝑁𝑖𝑗 𝜇𝑗𝑖 , 𝜎𝑗𝑖2

where

𝑁𝑖𝑖 = 𝑓 𝑝 𝑥𝑘𝑖 𝜆𝑖

𝑁𝑗𝑖 = 𝑓 𝑝 𝑥𝑘𝑖 𝜆𝑗

=

=

1 2𝜋𝜎 𝑖𝑖

𝑒

1 2𝜋𝜎𝑗𝑖



𝑒

𝑝



𝑥 𝑖𝑘 𝜆 𝑖 − 𝜇 𝑖𝑖 2𝜎 2 𝑖𝑖

2

𝑝 𝑥 𝑘𝑖 𝜆 𝑗 − 𝜇 𝑗𝑖 2𝜎𝑗𝑖2

2

The mean 𝜇𝑘 and variance 𝜎𝑘2 of resultant Gaussian curve will be as this.

36

𝜇𝑘 =

𝜎𝑖𝑗2 𝜇 𝑖𝑖 +𝜎𝑖𝑖2 𝜇 𝑗𝑖

𝜎𝑘2

𝜎𝑖𝑗2 +𝜎𝑗𝑖2

=

𝜎𝑖𝑗2 𝜎𝑗𝑖2 𝜎𝑖𝑗2 +𝜎𝑗𝑖2

To quantify the amount of overlap between two different Gaussians, we define the following ratio.

𝑂𝑖𝑗 =

max 𝑁𝑖𝑖 𝜇 𝑖𝑖 ,𝜎𝑖𝑖2 ,𝑁𝑗𝑖 𝜇 𝑗𝑖 ,𝜎𝑗𝑖2 max 𝑁𝑖𝑖 𝜇 𝑖𝑖 ,𝜎𝑖𝑖2 ,𝑁𝑖𝑖 𝜇 𝑖𝑖 ,𝜎𝑖𝑖2

If 𝜇𝑖𝑖 = 𝜇𝑗𝑖 , then the above equation becomes

𝑂𝑖𝑗 =

𝜎𝑗𝑖 𝜎𝑖𝑖

We always expect the overlap to be equal to 1, which means the two phonemes under considerations are exactly same. To achieve this, the 𝑂𝑖𝑗 has to be normalized and the normalized equation is

𝑂𝑖𝑗 𝑁 = 𝑂𝑖𝑗 −

=𝑒 This resultant 𝑂𝑖𝑗𝑁 two Gaussians.

𝜎 𝑗𝑖 𝜎 𝑖𝑖 𝜇 𝑘 −𝜇 𝑖𝑖 2𝜎 2 𝑖𝑖

2

+

𝜇 𝑘 −𝜇 𝑗𝑖 2𝜎 2 𝑗𝑖

2

is used as a measure to estimate the amount of overlap between

FIGURE 16 GAUSSIAN OVERLAPS

37

5.5 JUSTIFICATION FOR USING POG In an existing bilingual system with English and Mandarin languages, mean vector of each phoneme model and Kullback Leibler Distance is used to identify the (dis)similarities among the subword units of corresponding language (Hui Lang. Yao Qian). In proposed system we used POG. If models are trained properly isolated style recognition will give maximum likelihood for all the examples. Thus we can check the correctness of trained models. From the likelihoods of each class of examples, we can calculate mean and variance of that class and a Gaussian curve can be formed with those values. The overlap between two Gaussian curves are calculated by POG method to identify acoustically similar phonemes between two different languages. When the examples of a phoneme is given to a phoneme model it calculates mean and variance and Gaussian can be constructed from those values as follows.

FIGURE 17 GAUSSIAN CURVE CONSTRUCTION FROM EXAMPLES

If both Gaussian curves overlaps to a greater extent, then it explains that the two classes constituting those two curves are mostly likely else they are different.

FIGURE 18 GAUSSIAN OVERLAP

38 CHAPTER 6

6. PHONEME ANALYSIS USING POG Based on the particular phoneme’s position in the Gaussian overlap, the relation between the models and that phoneme can be classified as follows. M1 – Mean 1 SD1 – Standard Deviation 1 M2 – Mean 2 SD2 – Standard Deviation 2 -- Log-likelihood of Phoneme Example Case(i) Log-likelihood

In case (i) both models give lower likelihood for the phoneme example under consideration. There is no chaos in this and they are completely different.

Case(ii) Log-likelihood

In case (ii) both models give higher likelihood for the phoneme example under consideration. Here also there is no ambiguity and clearly it says there is no acoustical similarity in that phoneme. Then comes the third case where one model gives higher likelihood and another model gives lower likelihood for the same phoneme example as shown in case (iii) diagram. Here these kind of examples are to be analyzed.

39

Case(iii) Log-likelihood

The overlap region generally has the higher valued region of one Gaussian curve and lower valued region of another curve. This region may be termed as dangerous zone, since it may contain acoustically similar phonemes which are to be merged. This is how POG helps in identifying acoustical similarities between the phonemes of two different classes (here two different languages).

6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS As discussed in chapter 5.5 JUSTIFICATION FOR USING POG, this analysis is going to be carried out with the phoneme examples of Tamil language with Tamil and English models.

FIGURE 19 ANALYSIS 1: TAMIL EXAMPLES WITH LANGUAGE MODELS

40 Every examples of a Tamil phoneme is given to both Tamil and English models and their likelihoods are calculated. From the likelihoods the mean and variance are calculated,m which a Gaussian can be assumed and the amount of overlap between them can be calculated as discussed in the beginning of this chapter. This POG analysis just gives the amount of similarities between the phonemes. But to merge the phonemes of two different languages we are in need of some criteria or threshold. Here, the condition for overlap is set as “No overlap value of a phone with other phones in same language should be greater than the overlap of the same phone of another language”. The amount of overlap obtained by giving Tamil phoneme examples from its inventory to Tamil and English model is considered as “Reference Values” as shown below. TABLE 3 REFERENCE VALUES OF ANALYSIS 1

Similar type of analysis should also been carried out among all the phonemes within a language. Because, if we have two phonemes with more overlap percentage than the reference value then there is no use in merging that phoneme with some other phoneme of another language. At the same time, one can clearly say there won’t be any similarities between a vowel and a stop consonant. Hence, before carrying the phoneme analysis within a language, it is preferable to classify the phonemes into five classes namely, stop consonants, vowels, fricatives, semi-vowels, and nasals as follows.

41

FIGURE 20 TAMIL PHONEME CLASSIFICATION

The POG analysis had been carried out among phonemes of same class as mentioned below and then the condition for merging had been applied.

FIGURE 21BLOCK DIAGRAM OF ANALYSIS I

However, the examples of one phoneme in a language given to its own model will generate hundred percentage of overlap. That case is not the major problem. The problem to be focused is how much overlap is obtained when a phoneme example is given to the other

42 phoneme models of same class and language. The result of above mentioned analysis is obtained as shown below. TABLE 4 POG RESULT OF TAMIL STOP CONSONANTS /B/ AND /D/

OVERLAP% (REF. TAMIL ENGLISH

VALUE)

/B/

/B/

97.517784

/D/

/D/

92.282608

Tamil

Tamil

Overlap %

/b/

/p/

/b/

/t/

/b/

/d/

/b/

/k/

/b/

/g/

/d/

/b/

94.066856 76.261444

/d/

/p/

79.684105

/d/

/t/

77.252327

/d/

/k/

77.880081

/d/

/g/

84.142883

/b/ ref_value 97.517784 < /p/ 99.520767 99.520767 /d/

86.823174

ref_value 88.240143 > /g/ 84.142883

85.329987

MERGE „/d/‟ of Tamil and English

91.464981

CANNOT MERGE „/b/‟ of Tamil and English

Similarly the phonemes among other categories also analyzed in same way and totally 10 Tamil phonemes listed below had been identified as acoustically similar phonemes between Tamil and English TABLE 5 POG RESULT OF TAMIL PHONEMES

/a/

/ch/

/d/

/dh/

/i/

/j/

/m/

/nq/

/s/

/y/

43 Based on this result, a bilingual system has been developed by following steps from 1 to 8 as discussed in Chapter 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH. The phone list used here to build this system is given below.

6.2 ANALYSIS 2:LANGUAGE.II EXAMPLES WITH TWO LANGUAGE MODELS Trying out the comparison and taking decision on that may give biased results. So a similar experiment as discussed in Chapter 6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS is carried out again with little change in parameters. Here examples of English phonemes are given to both Tamil and English models and experiment is carried out as mentioned in FIGURE 22

FIGURE 22 ANALYSIS 2: ENGLISH EXAMPLES WITH LANGUAGE MODELS

Examples of each English phoneme is passed to both Tamil and English models from their corresponding inventories and likelihoods are calculated. From these likelihoods, mean and variance are calculated from which in turn a Gaussian curve can be formed. The likelihoods obtained by giving examples of one language to the phoneme models of another language may be termed as cross likelihoods to differentiate the likelihood value from the other one. Product of Gaussian method is used to calculate the amount of overlap between

44 these Gaussian curves obtained for same phoneme with its models in two different language. As in previous experimental setup, here also these values are considered as “Reference Values”. The reference values obtained by the above discussed procedure is as listed in TABLE 6 REFERENCE VALUES OF ANALYSIS II. TABLE 6 REFERENCE VALUES OF ANALYSIS II

The same condition is used here to carry out merging. “No overlap value of a phone with other phones in same language should be greater than the overlap of the same phone of another language”. Instead of comparing each and every value with the reference value, the maximum amount of overlap of a phoneme model with others in same category can be compared with the reference value. If this maximum amount of overlap value itself is less than the reference value, that phoneme in two different languages can be merged. Otherwise, that phoneme cannot be merged and should be pre-appended with language tag as explained in Chapter 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH. Total 40 English phonemes excluding silence(/SIL/) are categorized into five classes namely stop consonants, fricatives, vowels, semi-vowels and nasals as mentioned below in FIGURE 23 ENGLISH PHONEME CLASSIFICATION

45

FIGURE 23 ENGLISH PHONEME CLASSIFICATION

The above mentioned analysis is carried out among the phonemes of same categories. and the maximum overlap value of a phoneme in a class is compared with its reference value. Here the overlap calculating process is explained for the phoneme /w/ which belongs to in semi-vowel category and the procedure is explained step by step procedure in following block diagram.

FIGURE

24

BLOCK

DIAGRAM

OF

ANALYSIS

II

46 The output of the analysis for the phoneme /w/ of semi-vowel category as described in FIGURE 24 BLOCK DIAGRAM OF ANALYSIS II is listed below. TABLE 7 POG RESULT OF ENGLISH SEMI-VOWELS

TAMIL ENGLISH

OVERLAP% (REF. VALUE)

/W/

/W/

99.874603

/Y/

/Y/

89.993439

Tamil

Tamil

Overlap %

/w/

/l/

99.838074

/w/

/r/

99.543419

/w/

/y/

95.659401

/y/

/l/

70.418198

/y/

/r/

60.270432

/y/

/w/

34.439503

/w/ ref_value

99.946846 > /l/ 99.838074

MERGE '/w/' of English and Tamil /y/ ref_value

85.256615 > /l/

70.418198

MERGE '/y/' of English and Tamil CANNOT MERGE „/b/‟ of Tamil and English

The same procedure is carried out for rest of the English phonemes and eight English phonemes are identified to have acoustical similarity with same phonemes in Tamil. SO the list of phonemes that can be merged as per this analysis is as follows. TABLE 8 TABLE 5 POG RESULT OF ENGLISH PHONEMES

/E/

/l/

/m/

/nq/

/r/

/w/

/U

/y/

Analysis 1 lists 10 acoustically similar phonemes whereas analysis 2 lists 8 phonemes as having acoustical similarities. Systems cannot be keep on building with each experimental results. To finalize the better system there should be some conclusion. If a system with common phonemes of both analysis results are built, there won’t be any appreciable changes in memory since only 4 phonemes including silence /SIL/ are in common. TABLE 9 COMMON PHONEMES IN RESULTS OF ANALYSIS I AND II

/m/

/nq/

/y/

/SIL/

47 To get considerable change in foot – print size of the system, the result of both the analysis discussed in Chapter FIGURE 19 ANALYSIS 1: TAMIL EXAMPLES WITH LANGUAGE MODELS6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS and in Chapter 6.2 ANALYSIS 2:LANGUAGE.II EXAMPLES WITH TWO LANGUAGE MODELS are merged. TABLE 10 PHONE LIST BASED ON ANALYSIS I & II

/a/ /ch/ /d/ /dh/ /i/ /j/ /m/ /nq/ /s/ /y/ /E/ /l/ /r/ /w/ /U/ /SIL/

Except these sixteen phonemes together rest of the phonemes in both the languages are tagged with the corresponding language tags. The phone list generated is as follows: a

vA

vAI

m

vn

vN nq

U

w

xh

y

vz

vAU vb vo

ch

vO

d

vp

xA xae xah

xk xL xo xoy xp

dh

ve

E

vg vh

vqn vqN r vR xAI

xqN xsh

s

xao xAU xt xth xu

i I j vk

vsh SIL vt xb

xeh

xv xZ

l vth

xer

vL vu xf xg

xzh

With these phone list the another bilingual system, including and excluding syllable information has been built by following the steps from 1 to 8 as discussed in Chapter 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH. The memory occupied by the system without syllable information is 3624 KB and with syllable information is 3708 KB

48 CHAPTER 7

7. CONCLUSION So far four different bilingual systems namely all tagged system (Chapter : 3.10 BUILDING BILINGUAL TTS FOR TAMIL AND ENGLISH), blindly merged bilingual system (Chapter : 3.11 BUILDING BILINGUAL TTS WITH BLINDLY MERGED PHONEMES ), system with analysis I result (Chapter : 6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS) and finally the bilingual system with both analysis result (Chapter : 6.2 ANALYSIS 2:LANGUAGE.II EXAMPLES WITH TWO LANGUAGE MODELS) had been built. To conclude the work, performance evaluation is carried out with the outputs of all the four systems mentioned above. The outputs of all four bilingual systems are analyzed by Mean Opinion Score (MOS). This MOS in TABLE 11 MOS RESULTS OF ALL BILINGUAL SYSTEMS is based on the quality intelligibility, presence of language switching and language influence in a bilingual synthesized sentence. In bilingual systems, the sentence given to a TTS system also contains words from more than one language. Switching among the words of different language in same sentence should not be perceived obviously in the synthesized waveform of that sentence. If it has that difference, then the sentence is said to have language switching, which is not a desirable feature. Similarly, in bilingual sentences, the influence of one language should not be there on the other language words (Refer Chapter 3.11.1 NEED FOR FINDING ACOUSTICAL SIMILARITIES AMONG PHONEMES) TABLE 11 MOS RESULTS OF ALL BILINGUAL SYSTEMS

49 Another analysis is carried out among all the four bilingual TTS systems developed so far based on the memory occupied. The memory analysis of all four systems are listed below. TABLE 12 MEMORY OCCUPIED BY ALL FOUR SYSTEMS

Memory (in KB) SYSTEM System 1 All Tagged System 2 Blindly merged System 3 10 phonemes System 4 Uncommon

Without Syllable Information

With Syllable Information

3432

3789

3696

3972

3640

3768

3624

3708

Based on the analysis from performance evaluation TABLE 11 MOS RESULTS OF ALL BILINGUAL SYSTEMS, and memory occupied TABLE 12 MEMORY OCCUPIED BY ALL FOUR SYSTEMS, by considering both the quality and foot- print size, the HTSsystem built with the 10 acoustically similar phonemes identified by the experiment discussed in Chapter 6.1 ANALYSIS 1:LANGUAGE I EXAMPLES WITH TWO LANGUAGE MODELS discussed in section VII-C is considered as the best bilingual TTS system.

50

CHAPTER 8

8. FUTURE WORK Though we conclude that one amongst rest of the system as the better one, we can even improve the performance of the synthesized voice. To get better performance, we are in need of more number of examples for each phoneme. In order to get more number of examples if we increase the speech corpus size, it may make the TTS system infeasible to use with portable devices. So number of examples has to be improved without increasing the database size. One possible way to achieve this is “Forcible State Tying”. Each trained model of a phoneme have some n number of states (here 5 states). State tying will happen in HTS after tree based clustering. In general, HTS ties all the states of the phonemes clustered under the same leaf node at the end of tree based clustering (Chapter : 3.9.2 DECISION TREE BASED CLUSTERING)

FIGURE 25 DEFAULT STATE TYING IN HTS

51 Forcible state tying is the process of tying particular states of different phonemes which are similar. Forcible state tying is achieved by using “TI” command by specifying the states which are needed to be tied.

Specific states are forcibly tied

FIGURE 26 FORCIBLE STATE TYING

For example, consider two phonemes “/a” and “/AU/”. If the collected speech corpus has only few examples for the phoneme “/a/” then the number of examples for “/a/” can be increased by forcibly tying the states of “/a/” with particular states of “/AU/” which are responsible for the sound “/a/”.

2nd of 5 states of “a” and

2nd of 5 states of “AU” Can be tied

With the help of forcible state tying, the number of examples of the phonemes can be increased, which in turn increases the quality of synthesized speech considerably.

52

BIBLIOGRAPHY [1] Alan W. Black, KEvin A. Lenzo. 1999-2007. Festvox Documentaion. 2013 . [2] Ben Gold, Nelson Morgan. Speech and Audio SIgnal Processing. John Wiley & Sons, 2006. [3] Black, Andrew J. Hunt and Alan W. "Unit Selection in a Concatinative Speech Synthesis System Using a LArge Speech Database." IEEE 1996: 373-376. [4] Black, S P Kishore and Alan W. "Unit Size in Unit Selection Speech Synthesis." EuroSpeech. 2003. 1317-1320. [5] Christof Traber, Karl Huber, Karim Nedir, Beat Pfister, Eric Keller, Brigitte Zellner. "From MUltilingual to Polyglot Speech Synthesizer." (n.d.). [6] "Statistical Parametric Speech Synthesis." Heigen Zen, Kelichi Tokuda, Alan W. Black. Speech Communication. ELSEVIER, November 2009. 1039-1064. [7] Hui Lang. Yao Qian, Frank K. Soong, Gongshen Liu. "A Cross-Language State Mapping Approach To Bilingual ( Mandarin-English ) TTS." ICASSP. IEEE, 2008. 4641-4644. [8] Keiichi Tokuda, Heiga Zen, Alan W. Black. "An HMM-Based Speech Synthesis System applied to English." IEEE proceedings 2002: 227-230.Lawrence R. Rabiner, Ronald W. Schafer. Digital Processing of Speech Signal. Bell Laboratory: Pearson Education, 1993. [9] Richard O. Duda, Peter E. Hart, David G. Stork. Pattern Classification . Wiley, 2001. [10] S.Saraswathi, R.Vishalakshy. "Design of Multilingual Speech Synthesis System." Intelligent Information Management, January 2010: 58-64. [11] Sangal, Sebsibe H/Mariam. S P Kishore Alan W Black Rohit Kumar Rajeev. "Unit Selection Voice for Amharic Using Festvox." 5th ISCA Speech Synthesis Workshop. Pittsburgh, n.d. 103-107.

53 [12] Steve Young, Gunnar Evermann and Et.al. The HTK Book. Cambridge University Engineering Department, 2006. [13] T. Nagarajan, D. O'Shaughnessy. "Bias Estimation and Correction in a Classifier using Product og Likelihood Gaussians." International Conference on Acousticss, Speech and Signal Processing, ICASSP. 2007. 1061-1064. [14] Yao Qian, Hui Liang, Frank K. Soong. "A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin-English) TTS." IEEE Transcation on Audio, Speech and Language Processing, Volume 17, NO.6, August 2009 [15] Keiichiro Oura, Heiga Zen, Et.Al. "Tying Covariance Matrices to Reduce the Footprint of HMM-based Speech Synthesis Systems". InterSpeech-2009. [16] S. J. Young, J. J. Odell, P. C. Woodland. "Tree-Based State Tying for High Accuracy Acoustic Modelling". 307-312

Suggest Documents