Building a Tamil Text-to-Speech Synthesizer using

1 downloads 0 Views 1MB Size Report
As a requirement, we have developed a complete Tamil Text to Speech System ..... from a book. ..... The core functions and modules of Festival are written in C++ . ...... the figure are the actual boundaries of the phonemes marked by manual ...
Building a Tamil Text-to-Speech Synthesizer using Festival

A Project Report Submitted in Partial Fulllment of the Requirements for the Degree of

Master of Engineering in Signal Processing

by

Sreekanth Majji

Department of Electrical Engineering Indian Institute of Science Bangalore - 560 012 JUNE 2007

Acknowledgments

To begin with the most inuencing factor of this piece of work, I would like to express my deep sense of reverence and gratitude to my guide and advisor, Prof. A. G. Ramakrishnan, for initiating, motivating and leading me through troughs and crests of this work.

I am

indebted to him for his kind cooperation and encouragement during the course of the work. I will remain indebted to all the faculty members of the department especially to Prof. K. R. Ramakrishnan, Prof. Manojkumar Saranathan, and Prof. K. Rajgopal for their kind help and constant encouragement during my course work.

I would like to express my heartful

thanks to Shanker and Laxmi Narayana for their informal discussions during my project work, which gave me some technical insights. My sincere thanks to all my classmate friends especially Krusheel, Viswanath, Arun for their help during my course work. My special thanks goes to Raj Kirubhakar who has given me the initiative to get into the area of TTS. His technical and moral support during the initial stages of my project was invaluable. Finally, I express my deep sense of gratitude to my parents, for their care, encouragement and support.

Abstract Ever since machines have assisted humans in their activities, the idea of developing Talking Machines gained high signicance. Over the recent years, very good quality laboratory and commercially available speech synthesis systems have become available.

Current research

interests in speech synthesis are oriented towards making synthesized speech sound more natural and intelligible. A good prosody model plays a major role in determining the naturalness and intelligibility. Though no such attempt to develop a prosody model has been made in this work, others techniques to improve the speech quality have been looked into. As a requirement, we have developed a complete Tamil Text to Speech System (TTS) in Festival. Festival is a workbench developed by CMU, which provides an environment for both investigation and development of speech synthesis techniques.

Festival supports dierent

methods of concatenative synthesis, and it has been used as a platform for building the unit selection synthesizer. We discuss practical implementation details and we have created and made use of an unit selection database using a phonetically rich Tamil speech corpus with 1027 sentences. The project gives comprehensive details about the various modules in the TTS developed using Festival. In other words, we deal with how to add a new language to the existing set of languages in Festival, keeping the mathematical detail at minimum. The prosody models supplied by Festival have been incorporated into TTS. An optimal coupling technique for joining the units has been used for concatenation and waveform generation. One major problem in the synthesized speech is with the discontinuities in the pitch contour of the utterance, and we address the issue by doing pitch smoothing at the junction of the units. For this, pitch detection in conjuction with a pitch modication algorithm, has been used. The algorithms were rst tested on phonemic units and are later incorporated into the TTS as a post processing technique.

Contents

1

Introduction to TTS 1.1 1.2

Aim of the Project

Steps involved in Text to Speech 1.2.1

1.2.2

2

. . . . . . . . . . . . . . . . . . . . . . . .

2 2

1.2.1.1

Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2.1.2

Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . .

2

1.2.1.3

Prosody Modeling

. . . . . . . . . . . . . . . . . . . . . . .

3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

DSP Component

Waveform Generation

4 5

7

2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Using Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.4

System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.1

Utterance Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.2

Utterance Types

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3.3

Utterance access and use . . . . . . . . . . . . . . . . . . . . . . . . .

12

Voice in a new language

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Unit Selection in Festival

14

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.2

4

. . . . . . . . . . . . . . . . . . . . .

Practical Issues in TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Festival Speech Synthesis System

2.3

3

1

NLP component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2.2.1 1.3

1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.2.1

Clustering Units

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2.2

Selecting Units

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Building a Tamil Voice - Training

18

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

4.2

Developing a Tamil Voice in Festival

. . . . . . . . . . . . . . . . . . . . . .

18

4.3

The Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.3.1

System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4.3.2

Data Description

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Utterance Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.4

Building a Phoneset

4.5

Text Analysis

4.6 4.7

Making Cepstrum Parameter Files

. . . . . . . . . . . . . . . . . . . . . . .

21

4.8

Building the Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

CONTENTS 5

Synthesizing an Utterance - Testing 5.1

Grapheme-to-Phoneme Conversion

. . . . . . . . . . . . . . . . . . . . . . .

23

5.2

Building Prosodic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

5.2.1

Duration Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

5.2.2

Intonation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

5.3

Creating Utterance Structure

. . . . . . . . . . . . . . . . . . . . . . . . . .

25

5.4

Waveform Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

5.4.1

Unit Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

5.5

Units in Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

5.6

Joining the Units

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.7

Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

5.8

Reducing Pitch Discontinuities . . . . . . . . . . . . . . . . . . . . . . . . . .

30

5.8.1

Pitch Estimation

31

5.8.2

Autocorrelation Method

. . . . . . . . . . . . . . . . . . . . . . . . .

31

5.8.3

Harmonic Product Spectrum Method . . . . . . . . . . . . . . . . . .

32

5.9

6

23

Pitch Modication

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Results, Conclusions, and Future Work

39

6.1

Subjective Evaluation Tests

. . . . . . . . . . . . . . . . . . . . . . . . . . .

39

6.2

Future Scope

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

A Sample Files in Festival

42

Bibliography

55

List of Figures

3.1 3.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10

Partitioning of a class of units to minimise cluster impurity . . . . Viterbi search path through the best sequence of candidate units

. . . . . . . .

16

. . . . . . . .

17

A simple CART tree with clusters at its leaf nodes . . . . . . . . . . . . . . . . Waveform for unit /a/ (the selected portion) as seen in Praat . . . . . . . . . . Waveform for unit /L/ (the selected portion) as seen in Praat . . . . . . . . . . Synthesized Speech: Portion of the waveform and its pitch contour. Encircled are the abrupt pitch discontinuities . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Waveform of a segment of the vowel /a/ (b) The autocorrelation function of the signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation of HPS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . Spectrum of the signal shown in gure 5.5 (a) and its downsampled versions. The topmost curve is the sum of all these spectra . . . . . . . . . . . . . . . . . . . Portion of a pitch contour where smoothening has been done by curve-tting . . Illustration of pitch modication for PSF>1 : (a) Original frame (b) Pitch scaled version (c) Time scaled version of (b) . . . . . . . . . . . . . . . . . . . . . . . Illustration of pitch modication for PSF Festival oers what is called a read-eval-print loop, because it reads an s-expression (atom or list), evaluates it and prints the result.

festival> (car '(a d)) a festival> (+ 15 20) 35 In addition to the standard Scheme commands, a number of commands specic to speech synthesis are included. Although there are simple methods for getting Festival to speak, there are some basic underlying explicit functions used in synthesizing an utterance.

2.2 Using Festival

8

Utterances can be of various types (see section 2.3.1), but the simplest form is plain text. We can create an utterance and store it in a variable.

festival> (set! utt1 (Utterance Text Hello World)) # festival> The (hex) number in the return value is the print form of the utterance.

The internal

structure can be very large, so only a token form is printed. Although this creates an utterance, it doesn't do anything else. To get a waveform, we must synthesize it.

festival> (utt.synth utt1) # festival> This calls various modules, including tokenization, duration, and intonation. Which modules are called is dened with respect to the type of the utterance, in this case is possible to individually call the modules by hand, but

utt.synth

Text.

It

integrates all the

modules into one. Then,

festival> (utt.play utt1) # festival> will send the synthesized waveform to the audio device. The speech is in one of the builtin voices supplied along with Festival, and one of these can be set as the current voice. To make this all easier, a small function doing these three steps exists.

SayText

simply

takes a string of text, synthesizes it, and sends it to the audio device.

festival> (SayText Good morning, welcome to Festival) # festival> Festival may also synthesize from les rather than simply text.

festival> (tts myfile nil) nil festival> Rather than starting from the command interpreter, Festival may synthesize les specied on the command line.

$ festival tts myfile Sometimes a simple waveform is required from text that is to be kept and played at some later time. The simplest way to do this with Festival is by using the `text2wave' program. This is a Festival script that will take a le (or text from standard input) and produce a single waveform. An example use is

$ text2wave myfile.txt -o myfile.wav

2.3 System architecture

9

In the command mode, we can run a batch of commands using -b option. script can be run in festival environment using

A scheme

script

festival -b commandfiles.scm festival --script make_utts The text mode is selected by tts option. For invoking the text mode of synthesis:

festival --tts example.txt This command synthesizes the text contained in le example.txt. We can combine unix commands with Festival text mode.

$ echo Hello World | festival --tts

2.3 System architecture Festival integrates various processes and modules in a structured way. The motive is to use the outcome of one or more modules in another module in a systematic fashion. To achieve this, Festival introduces the concept of an

utterance structure [3].

Each text to be synthesized

is treated as an individual utterance structure. The various processes and modules in Festival act on the utterance structure to synthesize the text into waveform.

2.3.1

Utterance Structure

The utterance structure lies at the heart of Festival.

Festival's basic object for synthesis

is utterance, which represents some chunk of text that is to be rendered as speech.

In

general, the process of text to speech is to take an utterance, which contains a simple string of characters, and convert it step by step, lling out the utterance structure with more information until a waveform is built that says what the text contains. The processes involved in the conversion are, in general, as follows.



Tokenization



Token identication



Token to word



Part of Speech



Prosodic phrasing



Lexical lookup



Intonational accents



Assign duration



Generate F0 contour

2.3 System architecture •

10

Render waveform

More details on what occurs in each of these steps can be found in [1]. The number of steps and what actually happens may vary and is dependent on the particular voice selected and the utterance type. Each of these steps in Festival is achieved by a information to the utterance structure. An utterance structure consists of a set of

relations.

module,

items,

which will typically add new

which may be part of one or more

Items represent things like words and phones, though they may also be used to

represent less concrete objects like noun phrases. An item consists a set of features (name and value). the

Word

Relations are typically a simple list of items or trees of items.

For example,

relation is a simple list of items, each of which represent a word in the utterance.

Those words will also be in other relations, such as

SylStructure

relation, where the word

will be the top of a tree structure containing its syllables and segments. Depending on the language and the amount of information one can get from the database, we can use a subset of the total relations dened in Festival. For the Tamil language, with the database we have, only the following set of relations are used.

Text relations:

Word

The list of words in the input string.

Syllable the

A list of syllables in the text with positional index. Each member will also be in

SylStructure

relation. In that relation, its parent will be the word it is in and

its daughters will be the segments that are in it. Syllables are also in the

Intonation

relation giving links to their related intonational events.

Segment

A list of segments in the text. Each member will be a leaf node in the

relation. They may also be in the

SylStructure

Target

SylStructure

relation linking them to the F0 points

The tree linking the segment with the syllables and words relation. Each

Word

is the root of a tree, whose immediate daughters are its syllables and their daughters in turn are its segments.

Prosodic relations:

Intonation F0-Target Duration

A tree list with root as Intonation event and leaves in the syllable relation. A list of trees relating the segments with the F0 values.

A list of durations for the segments.

Wave relations:

Wave

A relation linking the utterance structure with the audio waveform.

For a sample le of utterance structure, see Appendix A.

2.3 System architecture 2.3.2

11

Utterance Types

The primary purpose of types is to dene which modules are to be applied to an utterance. The function

defUttType denes which modules are to be applied to an utterance of that utt.synth when invoked, applies this list of modules to an utterance

type. The function

before waveform synthesis is called.

Segment type utterance is synthesized, it needs to only have its Segment relation and a Target relation. This is dened as follows

For example, when a values loaded into a

(defUttType Segments (Initialize utt) (Wave_Synth utt)) A more complex type is

Text

type utterance, which requires many more modules to be

called before a waveform can be synthesized.

(defUttType Segments (Initialize utt) (Text utt) (Token utt) (POS utt) (Phrasify utt) (Word utt) (Intonation utt) (Duration utt) (Int_Targets utt) (Wave_Synth utt) ) These types are used when calling the function

utt.synth,

as shown in section 2.2. In

our project, we avoided the use of this in built function, as this requires the use of all the relations specied in Festival. Since we use only a subset of relations, we call the individual modules explicitly by hand, as required. A number of utterance types are currently supported in Festival. The prominent of them are dened below:

text

Raw text as a string

(Utterance Text This is an example) Words

A list of words

(Utterance Words (This is an example)) Segments

This allows specication of segments, durations, and F0 target values.

(Utterance Segments ((# 0.19) (h 0.055 (0 155)) (a 0.037 (0.018 136)) (l 0.064)

2.4 Voice in a new language

)

12

(o 0.208 (0.0 134) (0.1 135) (0.208 123)) (# 0.19))

The format of each segment entry is segment name, duration in seconds, and list of target values.

Each target value consists of a pair of points into the segment (in

seconds) and F0 value in Hz.

Phones

This allows a simple specication of phones. Synthesis species xed durations, and

monotone intonation.

2.3.3

Utterance access and use

The utterance structure is the key object in festival speech synthesis system. Easy access methods are available as inbuilt scheme functions.

;; load utterance object festival>(set! utt(utt.load nil "/home/festival/utts/sample_01.utt")) ;; Relations can be saved in an

Xlabel

format and loaded at ease.

;; to save Word relation in Xlabel festival>(utt.save.relation utt 'Word "utt_word.Word") ;; DSP procedures (EST tools) can be accessed by the utterance structure via simple functions. Utterance structure can also associate a standard audio waveform to the text, using access functions.

;; to load wave audio festival> (utt.relation.load utt 'Wave "work/sample1.wav") ;; ;; The easiness of access functions simplies the TTS development task.

2.4 Voice in a new language In Festival, a new language TTS requires a new

voice model

denition. Voice model of a

language describes the language phoneset, basic lexical rules and speaker specic details. Festival version 1.4.3 has voice models dened for British-English, Spanish and AmericanEnglish languages. To dene a voice model for Tamil, we require the following:



Phone set for the language



Token processing rules

2.4 Voice in a new language •

Utterance creation



Prosodic modeling



Unit selection



Waveform synthesis

13

The quality of the voice is determined by the amount of time and eort put in each of the modules. The steps involved in voice creation are elaborated in chapter 4.

Chapter 3 Unit Selection in Festival

3.1 Introduction Speech synthesis by concatenation of sub-word units (e.g. diphones) has become basic technology. It produces reliable clear speech and is the basis for a number of commercial systems. However, with simple diphones, although the speech is clear, it does not have the naturalness of real speech. In attempts to improve naturalness, a variety of techniques have been recently reported which expand the inventory of units used in concatenation from the basic diphone schema. There are a number of directions in which this has been done, both in changing the size of the units, the classication of the units themselves, and the number of occurrences of each unit. A convenient term for these approaches is selection based synthesis. In general, there is a large database of speech with a variable number of units of each particular class. The goal of these algorithms is to select the best sequence of units from all the possibilities in the database, and concatenate them to produce the nal speech. The higher level (linguistic) components of the system produce a target specication, which is a sequence of target units, each of which is associated with a set of features. In the algorithm described here, the database units are phones, but they can be diphones or other sized units. In the work of Sagisaka et al. [12], units are of variable length, giving rise to the term non-uniform unit synthesis. In that sense, the units used here are uniform. The features include both phonetic and prosodic context, for instance the duration of the unit, or its position in a syllable. The selection algorithm has two jobs: (1) to nd units in the database which best match this target specication and (2) to nd units which join together smoothly.

3.2 Clustering Algorithm The basic approach [10] is to cluster units within a unit type (i.e. a particular phone) based on questions concerning prosodic and phonetic context. Specically, these questions relate to information that can be produced by the linguistic component. Example questions are: is the unit phrase-nal, or is the unit in a stressed syllable.

Thus for each phone in the

database, a decision tree is constructed whose leaves are a list of database units that are best identied by the questions which lead to that leaf. At synthesis time, for each target in the target specication, the appropriate decision tree is used to nd the best cluster of

3.2 Clustering Algorithm

15

candidate units. A search is then made to nd the best path through the candidate units that takes into account the distance of a candidate unit from its cluster center and the cost of joining two adjacent units. This section deals with two topics - clustering, which is a part of TTS training, and selecting, which comes as a part of TTS testing. Both training and testing are discussed in detail in chapters 4 and 5, respectively.

3.2.1

Clustering Units

To cluster the units, we rst dene an acoustic measure to quantify the distance between two units of the same phone type. We use an acoustic vector which comprises Mel frequency cepstrum coecients, F0, power, and delta cepstrum. The acoustic distance between two units is simply the average distance for the vectors of all the frames in the units plus

x%

of the frames in the previous units, which helps ensure that close units will have similar preceding contexts. More formally, we use a weighted mahalanobis distance metric to dene the acoustic distance

Adist (U, V )

between two units

U

and

V

of the same phoneme class as:

if |V | > |U | |U |

n

Wd ∗ |U | X X Wj (abs(Fij (U ) − F(i∗|V |/|U |)j (V ))) Adist (U, V ) = ∗ |V | SDj ∗ n ∗ |U | i=1 j=1

(3.1)

where,

|V | = No. of frames in V |U | = No. of frames in U Fij (U ) = Parameter j of frame i for unit U Wj = Weight factor for parameter j SDj = Standard deviation for parameter j Wd = Duration penalty weigthing the dierence n = No. of parameters

between the lengths of the two units

This measure gives the mean weighted distance between units, with the shorter unit linearly interpolated to the longer unit. This acoustic measure is used to dene the impurity of a cluster of units as the mean acoustic distance between all members. The object is to split clusters based on questions to produce a better classication of the units. A CART method [11] is used to build a decision tree, whose questions best minimise the impurity of the sub-clusters at that point in the tree. A standard greedy algorithm is used for building the tree. This technique may not be globally optimal but a full global search would be prohibitively computationally expensive. A minimum cluster size is specied (typically between 10-20).

a

Figure 3.1 shows how a bag of database units of the phoneme class / / are clustered. Depending on the features the units possess, they are put into dierent clusters. Eectively, the whole feature space of the units is divided into certain regions, and the units falling in the same region are made as one cluster. Once the clusters are formed, the cluster centers are determined. Although the available questions are the same for each phone type, the tree building algorithm will select only the questions that are signicant in partitioning that particular type. The features used for CART questions include only those features that are available

3.2 Clustering Algorithm

Figure 3.1:

16

Partitioning of a class of units to minimise cluster impurity

for target phones during synthesis. In our experiments, these were: previous and following phonetic context (both phonetic identity and phonetic features), prosodic context (pitch and duration including those of previous and next units), position in syllable, and position in phrase. Full list of features used can be found in appendix A. Dierent features are signicant for dierent phones by their very nature, for example, a feature representing pitch is rarely used in unvoiced consonants. The CART building algorithm implicitly deals with sparseness of units in that it will only split a cluster if there are sucient examples and signicant dierence to warrant it.

3.2.2

Selecting Units

At synthesis time, we have a stream of target segments that we wish to synthesize. For each target, we use the CART for that unit type, and ask the questions to nd the appropriate cluster which provides a set of candidate units. distance of a unit

U

of the optimal coupling point between a candidate unit

Ui−1

Tdist (U ) is dened as the Jdist (Ui , Ui−1 ) as the join cost

The function

to its cluster center, and the function

Ui

and the previous candidate unit

it is to be joined to. We then use a Viterbi search to nd the optimal path through the

candidate units that minimizes the following expression:

N X

Tdist (Ui ) + W ∗ Jdist (Ui , Ui−1 )

(3.2)

i=1

W

allows a weight to be set optimizing join cost over target cost. Given that clusters typically

contain units that are very close, the join cost is usually the more important measure and

3.2 Clustering Algorithm

17

hence is weighted accordingly.

Figure 3.2:

Viterbi search path through the best sequence of candidate units

A pictorial representation of the above expression is given in gure 3.2. For each target unit, we have more than one matching candidate unit from the database. This match is not exact, and hence the amount of mismatch shows up as target cost. Given a set of units for each target unit, Viterbi algorithm uses the join costs (between adjacent units) and target costs (between target and the candidate units) and selects that sequence of units for which the combined cost is minimum. The Viterbi algorithm is a dynamic programming technique which recursively nds the best path i.e., the sequence of units which have a minimum M cumulative cost. Theoretically, there exist N paths, where N is the number of candidate units for each unit in the input string (for simplicity, we assume here that all the target units have the same number of candidate units), and

M

is the number of target units. These

paths are shown as dashed lines in the gure. But Viterbi algorithm selects the best paths in each stage, and ultimately selects the best path out of them by back tracking. This optimal path can be seen as a thick solid line in the gure.

Chapter 4 Building a Tamil Voice - Training

4.1 Introduction This chapter describes one of the main stages in the Tamil TTS - training. Training refers to the steps through which one uses a labelled database to make a system capable of producing a speech output. Training results in arrangement of the units extracted from the database in a format, from which the units can be selected meaningfully and with ease. As soon as the database is recorded and properly segmented and labelled, training is carried once for all. In Festival, this process is termed as

building a voice.

4.2 Developing a Tamil Voice in Festival Festival framework oers an environment to create and add new voices and languages to the system.

Building a voice of high quality for a specic language is a challenging and

dicult task. It is an open research problem. Much of the voice quality is determined by the waveform generation methods. The application domain also inuences the voice quality. Tailor-made applications like limited domain synthesis can have voice models of very high quality. In this section, we discuss the development of a unit selection voice model for unrestricted Tamil text. Simple rules for prosodic modeling, and waveform generation by simple concatenation are implemented. The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows.



Collecting the database of general speech.



Building utterance structures for the database



Building coecients for acoustic distances: typically, some form of cepstrum plus F0.



Building distances tables, precalculating the acoustic distance between each unit of the same phone type.



Providing each unit type with selection features (phone context, prosodic, positional)



Building cluster trees with the features and acoustic distances dumped by the previous two stages

4.3 The Setup •

19

Building the voice description itself

4.3 The Setup 4.3.1

System Description

The following are the system requirements:



The platform is Linux, version 2.4.21



Festival version is 1.4.3



EST tools version is 1.2.3



C++ compiler, version 3.2.2

Additional requirements for the design of TTS using Festival is given in chapter 6 of [3].

4.3.2

Data Description

For building a voice model, we use a single speaker data corpus of 1027 sentences.

The

sentences are recorded in noise free conditions, and sampled at 16 kHz. All the sentences are declarative and the speaker has a clear voice. For the purpose of speech synthesis, the corpus is manually segmented and labeled.

4.4 Building a Phoneset A new language requires a unique phoneset.

The phoneset should contain all the vowel

and consonant descriptions of the language.

The notion of phonesets is important to a

number of dierent sub-systems within Festival. The lexicons, letter to sound rules, waveform synthesizers, etc. all require the denition of a phoneset before they will operate. After text normalization, the text is phonetically transcribed using the phoneset. A phoneset is a set of symbols, which are dened in terms of features, such as vowel/consonant, place of articulation for consonants, and type of vowel. The set of features and their values must be dened with the phoneset. The denition is used to ensure compatibility between sub-systems as well as allowing groups of phones in various prediction systems (e.g. duration). A phoneset denition has the form

(defPhoneset NAME FEATUREDEFS PHONEDEFS ) The

NAME is any unique symbol used. FEATUREDEFS is a list of denitions, each consisting

of a feature name and its possible values. For example,

4.5 Text Analysis ( )

20

(vc + -) ;;vowel consonant vlength short long dipthong schwa 0) ;;vowel length ...

These descriptions are employed in syllabication and tree building in the unit selection scheme. The third section is a list of phone denitions themselves. Each phone denition consists of a phone name and the values for each feature in the order the features were dened in the

FEATUREDEFS

section. The phoneset le for Tamil language can be seen in appendix A

4.5 Text Analysis Text analysis comprises

text normalization

and

phonetic transcription.

In our work, the

input text given to Festival is normalized and phonetically transcribed. We carefully select text such that it needs no

text normalization.

phonetic transcription of input text.

We use two external tools for the task of

The end user inputs text in native language format

(Tamil), using iLEAP, which is an Indian language wordprocessor tool, developed by CDAC. iLEAP provides native language font support, and an user friendly GUI. The input text is exported into the ISCII format.

A grapheme to phoneme tool (G2P) is used to convert

the ISCII format data into its phonetic transcription. Hence our input to Festival speech synthesizer is a normalized, and phonetically transcribed text.

4.6 Utterance Creation The following information is essential for creating an utterance structure (see section 2.3.1): 1. Segment information with correct boundaries 2. Syllable information with boundaries 3. Word information with boundaries 4. Intonation events aligned to the syllable 5. Target F0 for the segments 6. Phrase information

Segment Information

For our Tamil TTS, we have the manually segmented and labelled

speech corpus. Segmental boundary is obtained directly from this information.

Syllable Information

We syllabify the segmented labels into VC, CV, CVC, CCVC,

CVCC combinations using a scheme script. At every instance of pause (#) character, we begin or break the text into a new syllable. We actually require a linguistic syllabication module.

4.7 Making Cepstrum Parameter Files Word Information

21

The database used has word boundaries marked with the word limiter

(#) in the text sentences.

By this, the words in each of the sentences, along with their

timings, have been extracted.

Intonation Events

Intonation events describe the fundamental frequency (F0) contour

peak F0, positive tilt

in a speech waveform. We quantize the intonation in terms of towards the peak), and

Target-F0

negative tilt

(slope from the peak) features.

Target-F0 describes the F0 values for a segment (phoneme). We calculate F0

values at the start, mid and end point of the phoneme unit.

The pitch is detected using

the built-in function (pda) in EST library [5]. We use the maximum

Target-F0

(slope

feature. Other alternative is to take

Phrase Information

average F0

peak F0

value as the

value as the value of

Target-F0.

We do not use phrase information in our utterance creation. The

complete sentence is treated as a single phrase. A lot of linguistic analysis is required to nd the phrase breaks in the sentence. Once all the above information are collected, we use the script Festival, to create a basic

utterance structure.

make_utts,

available in

The utterance structure will be used and

modied in the various stages of TTS development.

4.7 Making Cepstrum Parameter Files In order to cluster similar units in the database, we build an acoustic representation of them. This also is still a research issue but in the current work, we used Mel cepstrum. Interestingly, we do not generate these at xed intervals, but at pitch marks. Thus we have a parametric spectral representation of each pitch period. We have found this a better method, though it does require that pitch marks are reasonably identied. The script le

make_mcep takes wave

les as input and gives the MFC Coecients of the signal. It assumes that the pitch marks (obtained using

make_pm_wave)

are already obtained for that corresponding waveform, for

which MFCC's are being calculated.

4.8 Building the Clusters Festival provides a set of features (for the in-built languages) to build a cluster model for a database that has utterance structures and acoustic parameters. Those features can be as well used for a new language, with some additions/modications. The function

build_clunits is used for building the unit clusters, and the function runs

through the following stages:



The rst stage is to load all the utterances in the database, sort them into segment type and name them with individual names.



The next stage is to load the acoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the value is required many times.

4.8 Building the Clusters •

22

The next stage is to feed the features that are be used to index the clusters. function

dump_features

in the database.

The

does this job. It calculates the feature values for each unit

Appendix A shows one such le, where the features of all the

/I/

units in the database are put. The clusters are dened with respect to the acoustic distance between each unit in the cluster, but they are indexed by these features. These features then become available at synthesis time when no acoustic information is available. Thus they include features like phonetic and prosodic context, rather than spectral information. The full list of features is given in Appendix A.



Now that we have the acoustic distances and the feature descriptions (see Appendix A) of each unit, the next stage is to nd a relationship between the features and the acoustic distances. This we accomplish using the CART tree builder

wagon.

It nds out

questions about which features best minimize the acoustic distance between the units in that class.

wagon has many options many of which are opposite to this task though

it is interesting that this learning task is closed. That is, we are trying to classify all the units in the database; there is no test set as such. However, in synthesis, there will be desired units whose feature vector didn't exist in the training set. During tree building,

wagon uses two important parameters.

The rst one is

prune_reduce.

This gives the number of elements in each cluster to be removed in pruning. This removes the units in the cluster that are furthest from the center. This helps in pruning the tree without sacricing the performance of clustering, and at the same time saves time. The next parameter is

cluster_prune_limit, which denes the maximum num-

ber of units present in a cluster at a tree leaf. This parameter is actually used in post wagon build operation on the generated trees.

This is useful when there are large

number of a particular unit type, which cannot be dierentiated.



The nal stage in building a cluster model is to collect the generated trees into a single le and loading the unit catalogue, i.e. the list of unit names and their les and position in them (see Appendix A).

Chapter 5 Synthesizing an Utterance - Testing

Once the training is over, we come to the most important stage of the TTS - Testing. In testing phase, the system is given a text input, and a waveform is synthesized using the voice already built during training. This chapter explains the processes involved in the implementation of the waveform synthesizer, and concludes with a post-processing technique to improve the speech quality.

5.1 Grapheme-to-Phoneme Conversion The input to the Tamil TTS is a Tamil text sentence in ISCII format.

The phonemic

transcription of the sentence is obtained using a program called G2P. It is a module developed in our lab, and it uses a set of Letter-to-Sound rules, and a mapping table to get the nal output as a sequence of phonemes in ASCII format. These phonemes are the ones dened in the Tamil phoneset (see Appendix A). The G2P inserts # symbols between words which can later be used to delineate the words in the input sentence.

5.2 Building Prosodic Model Before the utterance structure of a sentence can be created, we need to predict the prosodic parameters of the sentence from the given text. We have used the prosody models provided by Festival. Prosody model involves the determination of two things: duration and F0. The models for these two parameters are separately discussed below.

5.2.1

Duration Model

Festival provides dierent duration models of increasing complexity and sophistication.

Simple Models

These include assigning xed duration to the phones, assigning average

durations to the respective phones, assigning duration to the phone depending on the position in the phrase by multiplying the average duration by a constant.

Tree Duration Model

A better way to build a duration model is to train from data

rather than arbitrarily selecting modication factors. Given a reasonable sized database, we can measure durations and features for each segment in the database. Then we can train

5.2 Building Prosodic Model

24

a model using those measurements. The advantage is that we can include the contextual information in predicting the duration.

The features used for building the duration tree

reect the phonetic context. Once we have the features and their descriptions, we use the program

wagon to build the tree, which is later used for predicting the duration.

The CART

tree implicitly contains the contextual information, because of the way the splitting occurs. It has been found that a better representation for duration modeling is

zscores,

i.e.

number of standard deviations from the mean. In order to train a zscore model we need to convert the absolute segment durations. To do that we compute the means and standard deviations for each unit type in the phoneset. For each phoneme in the input text, the features are extracted, and the phoneme traverses through the tree depending on the questions (about the features) at the nodes.

When it

reaches a leaf node, the zscore values of the units at the node are taken, their average is computed, and is taken as the zscore of the phoneme under consideration. To get the absolute value of duration from zscore, the following relation is used:

dur = zscore*stddev + mean where,

mean

and

stddev

correspond to the phoneme class to which the current phoneme

belongs.

5.2.2

Intonation Model

During training, two dierent trees are built - intonation tree and targetF0 tree. The intona-

postilt (positive slope) and negtilt (negative slope) as the features, and its nodes contain the weights and corresponding peakF0 values of all the syllables in the

tion tree is built using

training set, classied based on the two features. The targetF0 tree is not a tree, but it contains a set of intercepts and coecients obtained by Linear Regression (LR). For every syllable in the database, we calculate the start, mid and end F0 values, in addition to

postilt

and

negtilt.

The following assumptions have

been made:

peakF0



startF0 : inuenced by



midF0 : inuenced by

postilt



endF0 : inuenced by

peakF0

and the and

and

postilt

of the syllable unit

negtilt

negtilt

of the syllable unit

Based on these, we form three equations, which relate the F0 values to the corresponding variables. These equations are of the form

y = a0 + a1 x1 + a2 x2 By using LR technique, we nd the intercepts and coecients present in these equations. Hence we have three dierent sets of coecients, one for each of the three dierent types of F0. Once the trees are built, they can be used during run-time for generating the F0 values of the text to be synthesized. The F0 generation during testing can be summarized in the following steps:



For each of the syllables in the input text, nd the best matching syllable from the database based on context. Let us suppose that there are

n

such syllables.

5.3 Creating Utterance Structure •

From the intonation tree generated during training, we get the

n

peakF0

values of those

selected syllables. The one having the highest weight is taken as the best matching

syllable, and its



25

peakF0

is taken.

Now that we have F0 values for each of the syllables in the input text, we obtain an intonation contour by interpolation.



The next step would be to t a hat model to the intonation contour of each segment. The LR coecients are obtained for the startF0, midF0, and endF0 using the database, as described earlier.



These coecients/weights are then used to nd the F0 values of the segments in the input text using the equation given below.

F = F0 +

2 X

wi fi

i=1 where,

F0 wi fi

= Intercept, = weights generated by LR

= Feature values on which

F

depends

The whole idea is to nd the appropriate F0 target value for each syllable based on available features (postilt and

negtilt), by training from data.

A set of features is collected

for each syllable and a linear regression model is used to model three points on each syllable - start, mid, and end [3]. The predicted F0 values can be used to compute the target costs used in the unit selection module.

5.3 Creating Utterance Structure Once all the relations (phonetic and prosodic) required to create an utterance structure are obtained, we use the script

make_utts to build the utterance structure for the input sentence.

This is similar to creating utterance structure for a sentence in a database (see section 4.6), except that the prosodic relations can be readily calculated for the latter, whereas, the prosodic relations for the test sentence have to be predicted from the models used.

The

utterance structure contains all the information that can be obtained from the sentence; more the information we have, better the synthesis could be.

5.4 Waveform Synthesis Waveform generation is the last stage in TTS. The methods discussed in section 1.2.2 namely,

formant synthesis,

and

articulatory synthesis

have, so far, not produced speech of natural

quality. Both methods try to produce speech from a knowledge set of rules and parameters. Speech is not so simple a signal that can be completely described through a set of rules and formulas. Also synthesized speech must contain the essential co-articulation eects to make it sound natural. The solution is to concatenate waveform units selected based on matched context. This forms the crux of concatenative synthesis.

5.4 Waveform Synthesis 5.4.1

26

Unit Selection

The utterance structure created for the sentence gives the necessary information for the unit selection module to pick up the right sequence of units to concatenate. The cluster selection tree built during training is a CART tree, which arranges the total available units in the database according to the context in which they occur. The decision tree contains a binary question (yes/no answer) about some feature at each node in the tree (section 3.2.1). The basic syntax of a CART tree is as follows:

TREE ::= LEAF | QUESTION-NODE QUESTION-NODE ::= ( QUESTION YES-NODE NO-NODE ) YES-NODE ::= TREE NO-NODE ::= TREE QUESTION ::= ( FEATURENAME is VALUE ) | ( FEATURENAME = FLOAT ) | ( FEATURENAME < FLOAT ) | ( FEATURENAME > FLOAT ) | ( FEATURENAME matches REGEX ) | ( FEATURENAME in ( VALUE0 VALUE1 ...

)

LEAF ::= ( STDDEV MEAN ) | ( ( VALUE0 PROB0 ) ( VALUE1 PROB1 ) ...

) MOSTPROBVAL ) ) |

any other lisp s-expression The leaves of the tree contain the best prediction based on the training data. This means that the segments of the same class, with similar features are clustered and put in the leaf node of the tree. Figure 5.1 shows a simple CART tree with just two levels.

At each question node, a

question (Qi ) is asked, and the node is split into two. This continues until the leaf nodes. The leaf nodes contain the clustered units. Each phoneme in the utterance structure starts from the root node of the tree. At every node, a question about a particular feature is asked. to the

YES-NODE

or the

NO-NODE.

Since the answer is binary, it goes

Each phoneme class has its own decision tree, and the

phoneme in the input sentence traverses the corresponding tree. Ultimately, the path ends up in a leaf node, which contains a cluster of units having the same phonetic contexts but dierent prosodic and acoustic parameters. This introduces some impurity and spread for the cluster. All the units in the cluster are taken as the possible candidates for the target unit. The target cost for a particular candidate unit is the cost of selecting that unit, and it is quantied as the distance of that unit from the center of the cluster in which it is located. In other words, the target cost gives a measure of how close the candidate unit is to the actual ideal unit as required by the target unit. Hence lower the cost, higher the resemblance of the candidate unit to the target unit. But closeness to the target unit is not an sucient criterion to select it for concatenation. The selected unit should also t in between the other units, so as to minimize the pitch and spectral discontinuities at the joins. Hence the concept of join cost is used for selecting the sequence of units. Join cost species the cost of joining two adjacent units. Lower the join cost, lesser the discontinuity at the junction. The join cost and the target cost are together used to pick the right unit. Details of how units are selected can be found in section 3.2.2.

5.5 Units in Festival

Figure 5.1:

27

A simple CART tree with clusters at its leaf nodes

5.5 Units in Festival Though Festival has a unit selection synthesizer, the scope of it is limited. This is because the size of units is uniform - it is a phone.

Hence the performance of this system would

be less as compared to the one using non-uniform units.

Even though there are multiple

occurrences of the phonemes in dierent contexts throughout the database, ultimately we will be concatenating phonemes. Choosing phones as the units has the obvious disadvantage that the number of concatenation points would be equal to the number of phonemes in the sentence. The other drawback is that the number of candidate units satisfying a given criterion would be very large. So it aects the time consumed in nding the best sequence of units using Viterbi search. Even it increases the time to build the cluster selection trees, because the number of distance calculations are expensively high. One more serious problem is that the co-articulation will not be eectively captured when phonemes are joined. However, this has been partially overcome in Festival by cleverly choosing the unit. Though the basic unit is a phoneme, it does not always pick up from the start of the phoneme to the end of the phoneme. In fact, it is from the mid of the previous phoneme to the end of the current phoneme. In verbose mode, the unit selection module prints output, which looks something similar to this: ... ...

unit: a_9468 leid sample_114 start 44.952999 end 45.018501 unit: L_1024 leid sample_140 start 3.987875 end 4.074260 ...

5.5 Units in Festival

28

The output shows the unique unit id (one of the candidate units in the cluster), the sample number of the wave le from which it has been extracted, and the start and end times (in seconds) of the unit in the le. The waveforms of the two units in the above example are shown in gures 5.2 and 5.3. The unit

/a/ can be seen in the Praat window as the highlighted portion in gure 5.2.

This

happens to be from the beginning to the end. The blue vertical lines at the bottom part of the gure are the actual boundaries of the phonemes marked by manual segmentation. In

/L/, shown in gure 5.3, the unit extends from the middle of the previous phoneme (which is /a/ in this case) to the end of the present phoneme /L/. Hence, the actual phoneme /L/ is just a part of the selected unit, which has a portion of /a/ plus the complete /L/. The unit so selected is similar to a diphone, except that it is not from middle the case of unit

of one phoneme to middle of the next phoneme. Also, the number of occurrences is large as compared to the single occurrence of a diphone in a database.

/a/ and /L/. Since /L/ has some beginning /a/, the continuity from the rst unit to the next is assumed to be maintained. Of course, this depends on how similar the two /a/'s are. The same phoneme can be spectrally In the example shown, we have to concatenate

portion as

dierent from the one with a dierent context.

In that case, we cannot expect a smooth

transition from one unit to the other. It all depends on what units the unit selection module picks up for synthesis.

Figure 5.2:

Waveform for unit /a/ (the selected portion) as seen in Praat

5.6 Joining the Units

Figure 5.3:

29

Waveform for unit /L/ (the selected portion) as seen in Praat

5.6 Joining the Units To join the consecutive candidate units selected from clusters given by the decision trees, we use an optimal coupling technique [9] to measure the concatenation costs between two units. We take the last three frames of the previous unit and the rst three frames of the current unit, and then nd the acoustic distance between these frames taken two at a time. This gives a total of 9 distances for a pair of adjacent units.

The pair of frames (called

join frames), which give the minimum distance, are taken and they are overlap added to concatenate. The frames in the rst unit to the right of the join frame are removed before concatenation.

Similarly, the frames in the second unit to the left of the join frame are

removed. This technique oers two results: the cost of a join and a position for the join. Allowing the join point to move is particularly important when our units are phones:

initial unit

boundaries are on phone-phone boundaries which probably are the least stable part of the signal. Optimal coupling allows us to select more stable positions towards the center of the phone. If the previous phone in the database is of the same type as the selected phone, we use a search region that extends 50% into the previous phone, otherwise the search region is dened to be within the phone boundaries of the current phone. In the example used for illustrating units in Festival, given in section 5.5, the previous phoneme was

/a/, and it has

/L/. The previous phone to /L/, in the database, is /a/, and hence the search for the join point extends to some portion of this previous phone to be concatenated to the current phone

5.7 Post Processing

30

(see gure 5.3). The actual measure of join cost is a frame based Euclidean distance. The frame information includes F0, Mel frequency cepstrum coecients (MFCC), and power. Although this uses the same parameters as used in the acoustic measure used in clustering, now it is necessary to weigh the F0 parameter to deter discontinuity of local F0, which can be particularly distracting in synthesized examples.

5.7 Post Processing Theoretically, the process of synthesizing a waveform for a given input text is complete with joining the units obtained by unit selection, as discussed in section 5.6. But this does not always guarantee a good quality speech output. In our project, where we use phones as unit, this is more so. Since the number of joining points is high owing to the use of phonemes, so will be the number of discontinuities. As pointed out earlier, there could be spectral and pitch discontinuities. In the literature, there are many studies which mainly concentrated on smoothing out dierent types of discontinuities that arise in synthesized speech. Karli [13] suggests a new method called the Prototype Waveform Interpolation (PWI) to reduce the mismatches at the concatenation points, and applies it for Turkish language where he used diphone database. Similar attempts have been made to reduce the spectral discontinuities and the paper by Phuay Hui Low

et al.

is one of them. In their work [6], they have done

a pole analysis procedure to estimate the formant frequencies, bandwidths, spectrum shape and dynamics. The obtained formant tracks are then used for formant smoothing purposes in TTS synthesis. In this project, we have attempted pitch smoothing on the synthetic speech as a post processing technique. For this, pitch has been found out and then pitch modication algorithm has been applied to smooth out the pitch contour. The pitch smoothening technique has been presented as a separate section below, where we rst deal with pitch detection, and then move on to pitch modication and smoothening, in that order.

5.8 Reducing Pitch Discontinuities Listening experiments on the sentences synthesized using the developed TTS showed some signs of audible pitch discontinuities spread across the whole waveform. A look at the pitch contour of the speech output (see g 5.4) revealed the correspondence between the audible pitch discontinuities and the visible jumps in the contour. Looking at the pitch contour, there appear to be at least two ways in which the smoothing can be done. One approach is to smooth the pitch contour around the point of discontinuity by curve-tting. The other possibility is to shift the whole pitch contour of a unit to match the pitch of the unit to its left. The second method may require severe pitch modications by a large scale factor on every unit. In addition, the computations required may be more, because we have to modify pitch of almost all units to get a continuous pitch contour. In view of the above discussion, we adopted the rst way of smoothing the pitch: pitch smoothing by curve-tting.

Below, we discuss how the pitch has been estimated, and how the pitch

smoothening is done.

5.8 Reducing Pitch Discontinuities

31

Synthesized Speech: Portion of the waveform and its pitch contour. Encircled are the abrupt pitch discontinuities

Figure 5.4:

5.8.1

Pitch Estimation

Pitch estimation is an age old problem which researchers are trying to solve. Several people have come up with a wide variety of algorithms, both in time and spectral domains, but none of them could claim that their method is accurate in every sense, in all the cases. Nevertheless, there are few methods which are simple yet give good enough performance in nding the pitch of clean speech. We discuss two such algorithms, one each in time domain and frequency domain, in the next two sections.

5.8.2

Autocorrelation Method

This is perhaps the simplest and basic method of pitch detection, and works very well for highly periodic signals. Hence this method suits best for nding pitch of vowels. The goal of the autocorrelation routines is to nd the similarity between the signal and a shifted version of itself. The mathematical denition of autocorrelation function is:

y(n) =

M X

u(k)u(k + n)

k=1 If the signal is periodic, the autocorrelation function

y(n)

also will be, and if the signal

is harmonic, the autocorrelation function will have peaks in multiples of the fundamental

5.8 Reducing Pitch Discontinuities frequency.

32

This technique is most ecient at mid to low frequencies.

Thus it has been

popular in speech recognition applications where the pitch range is limited.

0.06

0.4 X: 316 Y: 0.05612

X: 174 Y: 0.05426

0.3

0.04

X: 643 Y: 0.2373

X: 500 Y: 0.321

0.2 0.02 0.1 0 0 −0.02 −0.1 −0.04

−0.06

−0.2

0

50

100

150

200

250

300

350

400

450

500

−0.3

0

100

200

(a)

Figure 5.5:

signal

300

400

500

600

700

800

(b)

(a) Waveform of a segment of the vowel /a/ (b) The autocorrelation function of the

The pitch period of a signal is the distance between the peaks.

Instead of measuring

the distance between the peaks in the time domain, we measure it in the autocorrelation domain.

The acf function has peaks that correspond to the peaks in the signal.

A peak

picking algorithm is used to pick the major peaks, and it should ensure that it does not stop at the spurious peaks that may occur in between two major peaks. In fact, this phenomenon can be seen in gure 5.5, where we take the autocorrelation of a segment of the vowel

/a/.

The peak picking algorithm actually searches for a peak in a predened region after the highest peak at the zero lag position.

If the search region is too small, we may end up

in spotting the spurious peaks which are the local maxima in that region.

If we have a

prior knowledge about the pitch range the speaker has, then we can put a threshold on the minimum region in which the search should occur. If this problem is not taken care of, it leads to pitch doubling: the estimated pitch will be double the original pitch.

5.8.3

Harmonic Product Spectrum Method

This method comes under frequency domain pitch detection algorithms. The typical analysis consists of breaking the signal into small frames, multiplying by a window and get the short time Fourier transform (STFT) of the frame. If the signal is periodic, the Fourier transform will show peaks in multiples of the fundamental frequencies, the present algorithm [14] describes a way of nding which peak corresponds to the fundamental frequency. One problem of this approach is the fact that the STFT divides the audio bandwidth into a set of equally spaced frequency channels of bins where each channel is

n

Hz apart

from its neighbors. Since human pitch perception is basically logarithmic, this means that low pitches may be tracked less accurately than high pitches.

900

1000

5.8 Reducing Pitch Discontinuities

33

The HPS algorithm measures the maximum coincidence for harmonics according to equation 5.1 for each spectral frame,

X(ω). R Y Y (ω) = log( |X(rω)|) r=1

or,

Y (ω) =

R X

log |X(rω)|

(5.1)

r=1

Yˆ = max Y (ωi )

(5.2)

ωi

where R is the number of harmonics to be considered (R = 5, for example), and frequency

ωi

is in the range of possible fundamental frequencies. The resulting periodic correlation array,

Y (ωi ),

is searched for a maximum value,

Yˆ ,

as is shown in equation 5.2.

Octave errors are a common problem in pitch measurements from HPS. Almost always in these error cases, the pitch is detected one octave too high.

To correct for this error,

postprocessing should be done with the following rule: IF the second peak amplitude below initially chosen pitch is approximately 1/2 of the chosen pitch AND the ratio of amplitudes is above a threshold (e.g., 0.2 for 5 harmonics), THEN select the lower octave peak as the pitch for the current frame.

Figure 5.6:

Implementation of HPS algorithm

Figure 5.6 demonstrates the HPS algorithm graphically. We take a windowed frame of the speech signal and compute its FFT. To enhance the low frequency resolution, we take a 2048-point FFT of the frame by zero padding.

Then we take 4 downsampled versions

of the spectrum and add their log-absolute values, to get the multiplications to additions. maximum value,

Yˆ ,

The function

Y (ωi )

Y (ωi ).

By taking log we convert

is shown on the far right with the

being the most likely pitch for the analysis frame. Some nice features

of the method include: inexpensive, immune to additive and multiplicative noise, adjustable to dierent kinds of inputs (changing the number of harmonics to consider or replacing multiplication by addition).

5.8 Reducing Pitch Discontinuities

34

18 Original Spectrum Sum of the spectrums Downsampled Spectrums

16 14

Amplitude, dB

12 10 8 6 4 2 0 −2

0

10

20

30 40 Discrete Frequency

50

60

70

Spectrum of the signal shown in gure 5.5 (a) and its downsampled versions. The topmost curve is the sum of all these spectra Figure 5.7:

An equivalent implementation of the HPS algorithm is by using a frequency histogram. In this, the fundamental frequency is determined by measuring the frequencies of its higher harmonic components and computing the greatest common divisor (GCD) of these harmonic frequencies. The GCD can be determined by making an entry to a frequency histogram for each harmonic frequency and at integer divisions of the harmonic frequency. The frequency at the peak of the histogram represents the greatest common divisor, and hence the fundamental frequency. For the purpose of comparison, we have taken the same speech segment

/a/,

as shown

in gure 5.5. The spectrum of the segment obtained using a 2048-point FFT is shown in gure 5.7.

The gure also shows the 4 downsampled versions of the spectrum in dashed

line. The top most curve in the gure represents the function occurs at

k

= 15. This corresponds to

the sampling frequency, and

N

kFs /N

3

Y (ωi ).

In this case, the peak

= 15×16×10 /2048 = 117.5 Hz. Here,

Fs

is

is the size of the FFT. The value of pitch obtained from the

autocorrelation method is around 111 Hz. Applying the above two algorithms to dierent types of voiced frames, we have calculated and compared the pitch values. The pitch value given by Praat has been taken as benchmark, and the values obtained from the above two methods are cross checked. It has been found that the autocorrelation method does fairly well for vowels, and there is some margin of error for other voiced type of frames. But the HPS algorithm matches the Praat pitch values more closely even for non-vowels. So this algorithm has been adopted for nding pitch during pitch modication.

5.9 Pitch Modication

35

5.9 Pitch Modication As already pointed out in section 5.8, we can think of at least two ways in which smoothing can be done. We follow the rst method, where we do pitch contour smoothing by smoothing the pitch values at the junction of unacceptable pitch discontinuity.

Original Pitch Smoothed Pitch

200 180

Pitch, Hz

160 140 120 100 80 60 64

Figure 5.8:

66

68

70 72 Frame Number

74

76

78

Portion of a pitch contour where smoothening has been done by curve-tting

The list of units to be concatenated is given by the unit selection module, as mentioned in section 5.5. The units are taken one at a time, divided into frames, and the F0 values of the voiced frames are found out using the HPS algorithm. Though the label for the unit is available with us, we cannot use it to decide whether a frame is voiced or not.

This

is because of the fact that a unit may contain a phoneme and a portion of the previous phoneme (see section 5.5). So a unit may have some frames as voiced and some as unvoiced. Hence, to decide whether a frame is unvoiced or not, we use a threshold on the energy. Any frame having energy higher than the threshold is declared a voiced frame. Otherwise, it is unvoiced. For deciding whether to do a pitch modication at a join or not, we use the following criterion.

If the dierence in the pitch values of the current frame and the next frame is

more than some xed percentage of the current frame pitch, then we proceed with pitch modication. We also ensure that the two frames do not belong to the same unit, or else we will be modifying the natural pitch contour. Let the

k th

k

th

frame and the ( +1)

frame have a pitch discontinuity between them. Then

k

th

we take the pitch values of the four consecutive frames: ( -1)

,

k th , (k +1)th ,

k

th

and ( +2)

.

Then we t a polynomial of degree 2 for the four points obtained. The curve will have the equation of the type,

y = a0 + a1 x + a2 x2

5.9 Pitch Modication For

x

36

= 1, 2, 3, and 4, we know the values of

y

(F0). By curve-tting, we nd the coecients

ai . To get the new values of pitch on this smooth curve, we substitute the values of

x

in the

above equation. Figure 5.8 shows the original pitch contour and the smoothened curve that best ts the four original pitch values. Now we have four new pitch values corresponding to the frames taken at the beginning. Next step is to apply the pitch modication algorithm on these frames to change their pitch values to the respective new ones. Now coming to the method of pitch modication, there arise two cases. The original pitch may have to be increased or decreased as the need be. Here we dene a term called pitch scale factor (PSF) which is the ratio of original F0 to the modied or new F0. Depending on the whether PSF is greater or less than 1, the algorithm takes two dierent paths. The two are discussed below.

PSF is greater than 1 Since PSF > 1, the pitch has to be reduced from its original value. Theoretically, pitch is the number of pitch periods per second. If the pitch needs to be reduced, then the number of periods should decrease. Suppose the frame has the modied frame is such that it has the same

m

m

periods in a duration of

N

samples. If

periods in a larger time duration, then the

eective number of periods per second will have reduced. That means its pitch has reduced. But to keep the length of the modied frame same as the original, we take only the rst

N

samples.

0.1 0 −0.1

0

100

200

300

400

500

600

700

400

500

600

700

400

500

600

700

(a) 0.1 0 −0.1

0

100

200

300 (b)

0.1 0 −0.1

0

100

200

300 (c)

Illustration of pitch modication for PSF>1 : (a) Original frame (b) Pitch scaled version (c) Time scaled version of (b) Figure 5.9:

5.9 Pitch Modication

37

0.1 0 −0.1

0

50

100

150

200

250 (a)

300

350

400

450

500

0

50

100

150

200

250 (b)

300

350

400

450

500

0

50

100

150

200

250 (c)

300

350

400

450

500

0.1 0 −0.1

0.1 0 −0.1

Illustration of pitch modication for PSF1 in this case, Nm is greater than N, and the number of zeros to be Nm and N. By simple linear interpolation, we can nd the sample

added is the dierence of

values at the indices where zeros have been inserted. Once interpolation is done, we bring the length of the frame to its original size by taking the rst

N

samples. The modied frame,

thus, will have a pitch equal to PSF times the original pitch and the length the same as the original. Figure 5.9 shows how the original frame is modied to nally transform to a pitch modied one, for PSF>1. Figure 5.9 (b) shows the pitch scaled waveform, but with a larger duration. By taking only the rst

N

samples we get the time scaled version with duration equal to the

original one, as shown in gure 5.9 (c). The original frame considered in this example had a pitch of 117 Hz (calculated using HPS method). The new pitch value specied is 80 Hz. Hence PSF is 117/80 = 1.46. The pitch calculated for the pitch modied signal turned out to be 77.5 Hz, which is pretty close to the value we desired.

PSF is less than 1 For values of PSF less than 1, the pitch has to be increased as compared to its original value. By an argument similar to the one given above, we can say that if the modied frame is such that it has the same

m

periods in a smaller time duration, then the eective number

of periods per second will have increased. That means its pitch has increased. But to keep

5.9 Pitch Modication

38

the length of the modied frame same as the original, we extract a pitch period from the frame and replicate it. To decrease the frame duration keeping the number of pitch periods the same, we have to decimate the frame by removing the samples at regular intervals. The number of samples to be removed is determined by the PSF. The new frame length of PSF and

N.

Nm

is equal to the product

Nm is less than N, and the number of samples to Nm . Once decimation is done, we bring the length

Since PSF