Survey of Data-Driven Approaches to Speech Synthesis

0 downloads 0 Views 345KB Size Report
Oct 13, 1998 - Other methods include determining pronunciation by analogy to ... Data-driven approaches to word pronunciation have also been developed.
Survey of Data-Driven Approaches to Speech Synthesis Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology October 13, 1998

Contents

1 Introduction 2 Speech Synthesis System Overview 3 Natural Language Processing Subsystem 3.1 Text to Phonetic Units . . . . 3.1.1 Text Normalization . . 3.1.2 Word Pronunciation . 3.2 Text to Prosodic Parameters 3.2.1 Accenting . . . . . . . 3.2.2 Phrasing . . . . . . . . 3.2.3 Duration . . . . . . . 3.2.4 Intonation . . . . . . . 3.2.5 Energy . . . . . . . . . 3.2.6 Glottal Source . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Source-Filter Model of Speech Production Articulatory Synthesis . . . . . . . . . . . Formant Synthesis . . . . . . . . . . . . . Concatenative Synthesis . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. 8 . 9 . 9 . 10

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

4 Speech Signal Processing Subsystem 4.1 4.2 4.3 4.4

. . . . . . . . . .

. . . . . . . . . .

1 3 4

5 Data-driven Approaches to Concatenative Speech Synthesis 5.1 Components of a Concatenative Speech Synthesizer . . 5.2 Comparison of Approaches to Concatenative Synthesis 5.2.1 Speech Corpus . . . . . . . . . . . . . . . . . . 5.2.2 Speech Segmentation and Labeling . . . . . . . 5.2.3 Unit Inventory Design . . . . . . . . . . . . . . 5.2.4 Speech Coding and Storage . . . . . . . . . . . 5.2.5 Unit Selection . . . . . . . . . . . . . . . . . . . 5.2.6 Unit Concatenation . . . . . . . . . . . . . . . 5.2.7 Prosodic Signal Processing . . . . . . . . . . .

6 Discussion and Conclusions References

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5 5 5 6 6 6 6 7 7 8

8

11 11 13 14 15 16 17 18 19 20

21 23

1 Introduction Speech synthesis can be viewed as the automatic transformation of arbitrary or unrestricted natural language sentences from its text form into its spoken form. It is the machine analog to what a human does when he reads aloud. There are many useful applications of automatic text-to-speech conversion including talking aids for the vocally handicapped, reading aids for the visually impaired, and computer-to-human communication in over the telephone or in \eyes-busy" situations. In order to perform speech synthesis, many di erent speech, signal, and natural language processing problems need to be addressed. When we consider what a human does when he reads aloud, it becomes clear how complex and dicult the task of automatic speech synthesis really is. In addition to correctly pronouncing all of the words, the reader must also use the appropriate phrasing, intonation, accenting, and rhythm for his speech to sound natural. Since much of this information is not explicitly speci ed in the written form of a language, the reader must reconstruct it by using his grammatical and world knowledge to understand the content of the text he is reading. Since computers are still far from being able to achieve this level of understanding, only approximate and incomplete algorithms are available to model this complex process. As a result, despite much progress in the eld, completely natural speech synthesis is still an elusive goal. Research in speech synthesis has had a long history. Comprehensive reviews can be found in Klatt's 1987 JASA article [15], which covers work up to the late 1980s, and in Dutoit's 1997 book [6], which includes more recent work done in the last decade. Many of the earlier approaches to speech synthesis were based on knowledge-engineered rules derived from linguistic theories and acoustic analyses. These systems were developed and improved by iteratively analyzing the characteristics of natural speech and then using that knowledge to develop more precise rules to control the behavior of the synthesizer. The results were complex expert-systems that were able to synthesize speech well but were not conducive to technology or knowledge transfer and, as a result, dicult for other researchers to replicate in alternate systems. More recently, in the last decade or so, there has been a resurgence of work in speech synthesis with an emphasis on using data-driven and machine-learning methods. In these approaches, speech corpora annotated with linguistic information such as acoustic, phonetic, prosodic, and syntactic labels are used as the basis for statistical modeling. The computational models attempt to capture the relationship between the annotated class labels and the acoustic, spectral, and prosodic features derived directly from the speech signal. Model parameters are estimated automatically from the training data by optimizing some well de ned objective function and performance is quanti ed by evaluating the trained models on new test data. Because these techniques have well-speci ed training and testing procedures, they are easier to describe and are more amenable to technology transfer. Their use has been adopted by many di erent research groups and has facilitated the development of synthesis systems. There are several reasons for this trend toward using data-driven and machine-learning methods. One is the dramatic increase in storage capacity and computing power which has enabled the use of more data and compute intensive methods that were not previously feasible. Another is the development and widespread availability of large standardized text and speech corpora. Large amounts of data can now be examined, compared to the relatively small amount previously done by hand, and has encouraged the use of statistical modeling approaches. In addition, common training and test corpora allow more objective comparisons of di erent approaches and systems. A third reason is the successful development and application of data-driven and machine-learning methods in related elds such as continuous speech recognition. It is only natural to try to adapt and make use of some of these techniques in speech synthesis. There are also pragmatic reasons for this trend. From an engineering point of view, the use of data-driven algorithms has reduced 1

the time needed to build a speech synthesizer, to improve its performance, and to port it to new databases and languages. Although data-driven machine-learning methods are very important and useful, we need to temper our optimism about them when we consider that even the large size of current text and speech corpora (billions of words and hundreds of hours of speech, respectively) is miniscule in comparison to the combinatorial complexity of a natural language; no amount of data can provide adequate coverage for all the possible phenomena in a language. A good compromise, therefore, is to use knowledge-based approaches to constrain the structure of the models and then to use the data-driven and machine-learning methods to re ne the models and to estimate their parameters based on large training corpora. In this paper, we examine the use of data-driven and machine-learning approaches for speech synthesis. We begin with an overview of a complete speech synthesis system which can be divided into two main subsystems: a natural language processing subsystem and a speech signal processing subsystem. We then describe, in turn, the major components within the two subsystems and examine some of the di erent approaches that have been explored. In the main portion of the paper, we focus on the speech signal processing subsystem. Here, we describe, compare, and analyze the use of di erent data-driven and machine-learning approaches as exempli ed by appropriate components of the following three speech synthesis systems: Bell Labs' synthesizer, Microsoft's Whistler, and ATR's Chatr. The Bell Labs system can be characterized as more of a traditional diphone-based approach to concatenative synthesis while the Whistler and Chatr systems are representative of the more recent trend towards increasing use of data-driven and machine-learning methods. We nally conclude with some discussion and observations. Many references are used throughout the paper, but the primary references used for the three speech synthesis systems are the following: Bell Labs

 J. Olive, J. van Santen, B. Mobius, and C. Shih \Chapter 7 - Synthesis" in Multilingual Text-to-Speech Synthesis: The Bell Labs Approach (R. Sproat, ed.), pp. 191{228, Kluwer Academic Publishers, 1998.

Microsoft Whistler

 H. Hon, A. Acero, X. Huang, J. Liu, and M. Plumpe, \Automatic generation

of synthesis units for trainable text-to-speech systems," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA, vol. 1, pp. 293{296, May 1998.  X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Meredith, and M. Plumpe, \Recent improvements on microsoft's trainable text-to-speech system - Whistler," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, pp. 959{962, May 1997.

ATR Chatr

 N. Campbell and A. Black, \Prosody and the selection of source units for con-

catenative synthesis," in Progress in Speech Synthesis (J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, eds.), pp. 279{292, New York: Springer, 1997. 2

Meaning

Text

Natural Language Processing

Linguistic Information

Speech Signal Processing

Speech

Image Processing

Images

Figure 1: The major subsystems of a speech synthesis system.

2 Speech Synthesis System Overview A typical speech synthesis system can be divided into two major subsystems as illustrated in Figure 1. The rst subsystem is primarily a natural language processing module and involves the conversion of the input text into a linguistic representation which includes both phonetic and prosodic information. The phonetic units specify what sounds need to be produced while the prosodic parameters specify how they are to be produced. The second subsystem is mainly a speech signal processing module and involves the generation of the output speech waveforms using as input the linguistic information generated by the rst subsystem. Although it is beyond the scope of this paper, it is interesting to note that the basic speech synthesis problem can be generalized by extending it at both the input and output ends. Instead of starting from written text, one can begin with a meaning representation [31]. This is analogous to a human verbally expressing a concept or thought instead of reading aloud. This scenario is relevant in applications where the computer is responsible for creating the information that it wants to communicate and therefore, in some sense, \understands" what it wants to say. Examples include human-machine dialog, information access, and machine translation systems. By starting from a meaning representation, the phonetic and prosodic information can be created directly using spoken language generation techniques and many of the natural language processing steps can be bypassed since we don't have to derive the information from text anymore. On the output side, the synthesizer can generate, in addition to speech waveforms, the corresponding visual images in order to create a talking head [7,17]. This additional mode of output can help improve a listener's understanding of the synthesized speech and is useful in certain multimedia applications. To do this, another module that parallels the speech signal processing module is needed that takes as input the phonetic and prosodic information and outputs the relevant visual image sequence. In the following two sections, brief descriptions of the di erent components within the two main subsystems, the natural language processing module and the speech signal processing module, will be presented. The goal is to point out the many di erent interdisciplinary problems that have to be addressed in order to perform the complex task of speech synthesis and to mention the di erent approaches that have been taken to address these problems including the use of automatic data-driven methods. 3

Natural Language Processing Subsystem Text to Phonetic Units Text Normalization

Word Pronunciation

Phonetic Units

Text Accenting

Duration

Phrasing

Energy

Intonation

Source

Prosodic Parameters

Text to Prosodic Parameters

Figure 2: Block diagram of the components in the natural language processing subsystem.

3 Natural Language Processing Subsystem As illustrated in Figure 2, the task of converting the input text into a linguistic representation can be further partitioned into two components: the transformation of text to phonetic units and the conversion of text to prosodic parameters. As mentioned before, the phonetic units specify what sounds need to be produced while the prosodic parameters specify how they are to be produced. In order to have the exibility to be able to synthesize speech from arbitrary text, including new words and sentences, the synthesis units must be subword-based since the number of possible words is virtually unlimited. Because there are over 10,000 syllables in English, much smaller units such as phones, dyads (phone pairs), and diphones (the transition between two phones) have typically been used. The process of converting the input text into a stream of phonetic units is still needed regardless of which phone-based representation is used for synthesis. The phonetic units serve as an intermediate representation between the orthography (words) and the particular synthesis units used. They facilitate the modularization of the synthesis system into natural language processing and speech signal processing components. The mapping from the phonetic units to the appropriate synthesis units is taken care of in the speech signal processing stage. Prosody is a term used to describe the metrical structure of speech which includes the perceived rhythm, stress, and pitch of the speech. The physical correlates are, respectively, the duration, energy, and fundamental frequency of the speech. The pattern or trajectory of these parameters as a function of time carries linguistically signi cant information and is the key to producing natural sounding speech. As observed in [15], intensity contributes to the perceived stress and syllabic structure in speech; duration a ects the rhythm, stress, emphasis, and syntactic structure of the utterance; and the fundamental frequency conveys information about the intonation, stress, emphasis, gender, and emotional state of the speaker. The task of the text to prosodic parameters component is to generate a time varying trajectory of these prosodic parameters. Although it is convenient to partition the subsystem into these two components, we should keep in mind that there exists some overlap between them. For example, the prosodic process of breaking up an input sentence into phrases also impacts the sequence of phonetic units because pauses need to be inserted at the phrase boundaries. Similarly, the prosodic process of accenting can a ect the identity of the phonetic unit in some words: a vowel may be reduced to a schwa in an unstressed environment. We now examine the text to phonetic units conversion and the text to prosodic parameters conversion components in more detail. 4

3.1 Text to Phonetic Units 3.1.1 Text Normalization

The goal in the text normalization component is to transform the raw input text stream into a regularized format that can be processed by the rest of the system. This includes breaking up the input into sentences, tokenizing into words, expanding numbers (e.g., dates, times, years, telephone numbers, and currency), dealing with abbreviations (e.g., \St." can be either \Saint" or \Street"), and possibly tagging words with their part of speech labels to help with later pronunciation and prosody processing. Many approaches to text normalization are based on heuristics and rules [1]. For example, punctuation, capitalization, and white space can be used as cues for detecting sentence and word boundaries in many languages; numeral and abbreviation expansions can be disambiguated via a series of heuristics that examine the surrounding context. Data-driven methods have also been successfully used for various text normalization tasks. For example, in languages such as Chinese where words are not delimited by spaces, probabilistic models have been used to nd the most likely sequence of words in the input given probability estimates of the individual words in all possible analyses of the input [27]; these word probabilities are derived from machine analysis of large amounts of labeled text training data. Automatic data-driven statistical techniques have also been applied to other tasks such as part of speech tagging [27].

3.1.2 Word Pronunciation Once the sequence of words has been speci ed by the normalization procedure, the next step is to determine their pronunciations. The most successful approaches to date have been rule-based. The simplest approach is to use a set of letter-to-sound rules to map grapheme sequences to phonetic sequences [15]. In languages such as Spanish where there is a strong correlation between the orthography and phonology, this approach works well. In languages like English with more complex relationships between the spelling and pronunciation, the addition of pronunciation dictionaries containing entries for words which are exceptions to the rules have been e ective [15]. Another approach is based on morphological analysis [1]. A word is rst decomposed into its morphemes, the minimal meaningful units of a language such as pre xes, roots, and suxes; next, morpheme pronunciations are determined using using morpheme-to-sound rules and morpheme pronunciation dictionaries; nally, the pieces are combined using phonological rules to get the pronunciation of the whole word. Other methods include determining pronunciation by analogy to known words and disambiguating homographs by using part of speech and local/distant context information [15,27]. Data-driven approaches to word pronunciation have also been developed. For example, one approach uses a neural network classi er trained on a large pronunciation dictionary to model the transformation between letter sequences and phonetic sequences [26]; it takes as input a window of letters and predicts the phone corresponding to the middle letter. Another approach uses statistical decision trees trained on a large number of context-to-phone pairs derived from a pronunciation dictionary [18]. The idea is to model the phone distributions conditioned on context information such as the surrounding letters and the previously predicted phones. The resulting tree can then be used to predict the most likely pronunciation of a word given its spelling. In general, these and other data-driven methods have not been as successful as the rule-based ones. One reason is that many of these methods assume that all the information needed to pronounce a word is contained in its orthographic representation. The problem is that in many languages pronunciation is also dependent on lexical features (e.g., stress and part of speech) that are not speci ed in the spelling. 5

3.2 Text to Prosodic Parameters

3.2.1 Accenting

In a spoken utterance, some of the words are accented while others are not. Accent assignment is generally based on the broad lexical category of the word and the intended meaning of the sentence. Content words (e.g., nouns, adjectives, and verbs) are typically accented while function words (e.g., prepositions, and auxiliary verbs) are not. More complex problems occur in complex noun phrases or in situations where emphasis or contrast is desired. Accent information is used to help determine segment durations (See Section 3.2.3) and the intonation pattern of a sentence (See Section 3.2.4). Various approaches have been developed to perform word accenting. Most are based on rst determining the part-of-speech category of the words, and then using that information plus neighboring context to assign accents based on a set of rules. The rules can either be speci ed manually [1] or automatically discovered by analyzing a large amount of labeled text data. An example of the latter is the use of decision trees to predict accenting [27].

3.2.2 Phrasing A long sentence is typically broken up into smaller phrasal units. These phrases are important for specifying prosodic properties since there are typically pauses at phrase boundaries and the fundamental frequency and energy contours are usually reset at the beginning of a new phrase. The task here is to automatically assign phrase boundaries based on analyses of the input text. The simplest approach is to use punctuation marks such as commas, semicolons, and periods as indicators of phrase boundary locations. However, a problem arises in long strings of words without punctuation. In these instances, other methods have been used including keeping a list of words (function words and verbs) that are likely indicators of good places to break [15] and performing a syntactic parse of the sentence to discover the phrase and clause boundaries [1]. Data-driven methods have also been explored. In [27], a decision tree classi er is used to predict phrase boundaries locations. For every word boundary, it takes as input features such as part-of-speech information around the boundary and distance of the boundary from the edges of the sentence, and outputs a decision on whether or not there is a phrase boundary. The classi er is trained on a text corpus annotated with prosodic phrase-boundary information.

3.2.3 Duration In addition to specifying which phones need to be produced, it is also necessary to determine how long to make each one. There are many factors that in uence the duration of a phonetic segment. These include the identity of the phone itself, the characteristics of the neighboring phones, the accent status of the syllable containing the phone, its position within the phrase, and speaker characteristics such as speaking rate and dialect. Rule-based approaches that take into account some of these factors have been developed for predicting segment duration. In one approach, the duration of a segment is determined by successively applying a set of rules; each rule accounts for a particular factor and tries to change the segment duration by a percentage increase or decrease subject to a minimum duration constraint [1]. These rules are tuned by trial-and-error to match the observed characteristics of speech read by a single speaker. Many di erent rule systems have been developed and even though each used slightly di erent approaches, all have been able to successfully predict the same phenomena [15]. Data-driven methods for predicting segment durations have also been developed. One approach uses tree classi ers to automatically cluster phone durations according to their context. Examples 6

of context information include stress level, position in word, position in phrase, and broad class labels of neighboring phones [23]. Given a phone and its context information during synthesis, the tree can be used to estimate the phone's duration. Another approach uses a more constrained model that attempts to capture the interactions of various factors using a sum of a sequence of products of terms associated with each factor to compute an estimate of the duration [29]. Phones are clustered according to context and di erent duration models are selected for di erent groups of phones by performing exploratory data analysis to discover the model whose predictions show a good t to durations from a labeled speech corpus.

3.2.4 Intonation The goal of this component is to generate a fundamental frequency (F0 ) contour for the sentence to be synthesized. Intonation is probably the most important prosodic factor since, as we have discussed before, it a ects many of the perceived qualities in speech. This component uses the phonetic, accent, duration, and phrasing information created by the previous components. In many approaches, an intermediate prosodic structure (i.e., an intonation model) is used. The model is used to predict when F0 rises and falls and what level it will reach depending on syntactic structure, accenting, and sentence location. Based on this model, F0 -time pairs are created and then smoothed to get the nal trajectory. Intonation models can be categorized into three types. One is production oriented and models the commands governing F0 generation. The commands are impulse and step functions and are associated with phrase and word accents, respectively. The F0 contour is the response of a smoothing lter to these commands [9,15]. The second type of model is perceptually motivated. An F0 contour is generated by concatenating a series of stylized natural F0 contours. Automatic procedures for determining the stylized contours based on perceptual criteria have been developed [19]. The third type of model is based on a sequence of pitch target values derived from a phonological representation and linked by interpolation functions [22]. More recently data-driven statistical models such as neural networks and linear and regression tress, have been developed to predict F0 contours based on input linguistic information without having to use an intermediate intonation model [25]. Hybrid approaches like the F0 contour generator in the Microsoft Whistler synthesizer have also been developed [13]. In this approach, a pair of vectors, S:P, is computed and stored for each phrase in the training corpora. The speci cation vector S contains, for each syllable in the phrase, an abstract description of the intonation computed using the intonation model in [22]. The pitch vector P contains, for the corresponding syllables, the actual F0 values. The sentences in the training corpora are selected to try to maximize coverage of the intonation patterns in natural speech. During synthesis, a speci cation vector, S(i), for the input sentence is rst determined using the intonation model; then the most similar S vector from the training set is found using a dynamic programming search with an appropriate cost function; nally, the corresponding P vector is used to generate the F0 contour for the new sentence.

3.2.5 Energy In addition to a fundamental frequency contour, a spoken utterance also has an energy contour. Certain phones are more intense than others, and the ends of phrases are weaker than the beginnings. However, it has been observed that simply using the normal segmental energies inherent in each phone combined with the fundamental frequency contour is sucient to implicitly specify the energy contour [15]. The reason for this is that the perceived intensity of a sound increases as the fundamental frequency is raised; as a result, variation in intensity can be captured by changes in F0 . In fact, if stressed vowel energies are explicitly increased, arti cially strong vowels are produced [15]. 7

Amplitude F0

Glottal Pulses

Filter Coefficients

Filter

Voice/Unvoiced

Speech

Noise Source

Figure 3: Block diagram of the source- lter model of speech production.

3.2.6 Glottal Source In addition to the standard prosodic parameters of fundamental frequency, duration, and energy, other information related to the characteristics of the glottal source have also been found to be important in determining voice quality; these parameters include spectral tilt, open quotient, and aspiration noise [16]. The desire to improve the quality of synthesized speech has prompted work in developing more realistic models of the voice source. For example, a source model that can be dynamically controlled during synthesis using rules and whose parameters can be automatically estimated from natural speech has been developed and is used in the Bell Labs synthesizer [21].

4 Speech Signal Processing Subsystem The speech signal processing subsystem takes as input the linguistic (i.e., phonetic and prosodic) information generated by the natural language processing subsystem, synthesizes the speech, and outputs the nal speech waveforms. Approaches to this task can be categorized into two groups: those that attempt to model the speech production system and those that attempt to model the speech signal. Articulatory synthesis falls into the rst category while formant synthesis and concatenative synthesis fall into the second. In this section, we will examine these three synthesis approaches. However, we rst brie y describe the source- lter model of human speech production since it is the basis of many of the speech synthesis approaches including articulatory and formant synthesis and some methods used in concatenative synthesis.

4.1 Source-Filter Model of Speech Production According to the acoustic theory of speech production [8], speech can be viewed as the output of a linear lter excited by one or more sound sources as illustrated in Figure 3. The theory makes the simplifying assumption that the source is independent of the lter. There are typically two types of sound sources. One is voicing which is caused by vibration of the vocal folds and is represented by a quasiperiodic train of glottal pulses. The other is turbulence noise which is caused by a pressure di erence across a constriction in the vocal tract and is represented by a randomly varying noise signal. The lter simulates the frequency response of the vocal tract and shapes the spectrum of the signal generated by the source. This model has a variety of independent input controls. The noise source doesn't have a controlling input, but the glottal pulse generator is controlled by a F0 parameter which speci es the fundamental frequency of voicing. A mixer, controlled by voiced/unvoiced decisions, is then used to control the relative contributions from the pulse and noise sources. The signal is then scaled by an amplitude parameter to control the loudness and input to the vocal tract lter. A set of lter coecients, which are varied slowly, typically updated every 5 to 10 milliseconds, is used to control the lter in order to shape the speech spectrum. 8

A popular source- lter model of speech production is linear predictive coding (LPC). Linear prediction theory assumes that the current sample y[n] in a frame of speech can be predicted as a linear combination of the previous P samples plus a small error term e[n]: [ ]=?

yn

X P

=1

[ ] [ ? i] + e[n]

aiyn

i

where the a[i] are the linear prediction coecients and P is the linear prediction order. The coecients are automatically determined by minimizing the sum of the squared errors over the entire frame of speech under analysis. This is equivalent to matching the power spectrum of the all-pole lter de ned by the LP coecients to the spectrum of the speech signal. The lter therefore models the spectral envelope of the speech (i.e., the vocal tract) with the error signal capturing the harmonic structure and/or noise (i.e., the source). A simpli ed source model consisting of an impulse train for voiced speech and white noise for unvoiced speech is often used. More complex source models have been developed including multi-pulse LPC where the excitation signal consists of several pulses for each frame of speech analyzed, residual excited LPC where the error signal or residual is used as the excitation signal, and codebook excited LPC, in which the excitation signal is selected from a nite set of quantized possibilities.

4.2 Articulatory Synthesis

Articulatory synthesizers generate speech by modeling the human speech production system. Typically, physical models of the articulators and vocal folds are used. These models are based on detailed descriptions of the physiology of speech production and the physics of sound generation in the vocal tract. Mathematical rules which take into account the dynamical constraints of the articulators are used to compute the trajectories of the model parameters, i.e., the positions and kinematics of the articulators, according to the sequence of sounds to be produced. Sound is generated from the models by either computing it using equations of physics or by converting the articulator models into a transfer function and the vocal fold model into an appropriate excitation signal. The latter technique is essentially a source- lter model of speech production. Since articulatory synthesis approaches directly model the way speech is produced by humans, it is, in principle, the correct way to perform speech synthesis. However, these approaches have not been as popular or as successful as those that model the speech signal, i.e., formant and concatenative synthesis. One reason is the diculty in accurately modeling the complex articulator dynamics due in part to the lack of sucient data on articulator motions during speech production [15]. Another is that our understanding of the complex aerodynamics involved in speech production is incomplete. Finally, computer simulation of the physical models and the aerodynamic phenomena is computationally expensive and not practical in real-time speech synthesis systems.

4.3 Formant Synthesis

Formant synthesis is a source- lter model of speech production where the vocal tract and sound source models attempt to capture the main acoustic features of the speech signal. The vocal tract model tries to model speech spectra that are representative of the position and movements of the articulators. It is constructed from resonances that correspond to the formant frequencies of natural speech. Each formant is typically modeled using a two-pole resonator which allows both the center frequency and bandwidth to be speci ed. The individual formants are then combined to create the vocal tract model. In the parallel formant synthesizer, the excitation signal is rst scaled by an independently controlled gain parameter and then applied to each formant model in parallel to 9

produce separate outputs which are then summed. In the cascade formant synthesizer, a scaled excitation signal is applied only to the rst formant model in the sequence with the output of each model then becoming the input to the next. The cascade model has been found to be better for non-nasal voiced sounds while the parallel model is more appropriate for nasals, fricatives, and stops. A hybrid approach using both parallel and cascade methods of combination have also been developed [15]. The source model has two components, one to model the glottal ow for voiced sounds and another to model the turbulence noise source for unvoiced sounds. Typical source model parameters include fundamental frequency and amplitude of voicing. Improvements to the approach include the addition of more resonances and anti-resonances to help with the synthesis of nasalized sounds and the development of a more complex source model that allows voice quality parameters such as spectral tilt, open quotient, and aspiration noise to be speci ed [15]. Formant synthesizers have the capability to produce very high quality speech that can be almost indistinguishable from human speech if the trajectories of the model parameters are correctly speci ed [10]. The challenging task, however, is to be able to automatically determine the proper parameter trajectories. Both the source and lter models have typically been controlled by a set of phonetic rules that take into account prosodic and coarticulation e ects. The rules are used to convert the input phonetic sequence into an allophone sequence and to specify exactly how the allophones and their transitions should be produced. Typically several hundred precisely crafted rules are needed to control a formant synthesizer [1]. Since the number of control parameters is generally large, there has been some work in trying to reduce the number of parameters that need to be explicitly speci ed. In the approach described in [2], only a small set of high-level acoustic and articulatory parameters need to be speci ed. A set of mapping relations and constraints is then used to transform these high-level parameters into the larger number of low-level parameters required by the underlying formant synthesizer. Rule-based formant synthesis has been very successful and has been used as the underlying technology in many commercial speech synthesis systems [15]. It is compact in size since the rules don't take up much computer memory and it has the exibility to create new voices and voice qualities by simply changing the rules and parameters without having to modify the model.

4.4 Concatenative Synthesis In concatenative synthesis, speech is produced by retrieving appropriate intervals of stored natural speech, stringing them together, and then performing some signal processing on the result to smooth out the segment transitions and to match the speci ed prosodic characteristics. Desirable properties for the set of speech segments include accounting for as many coarticulatory e ects as possible, having minimal discontinuities at the concatenation points, and being as few in number as possible [6, 27]. Longer segments are able to capture more coarticulation and have fewer concatenation points than shorter ones; however, the number of di erent segments grows exponentially with the length of the segment. To minimize concatenation discontinuities, it is advantageous to store multiple instances of each segment to handle distinctions between di erent contexts; however, the size of the unit inventory will be dramatically increased. These con icting objectives mean that there has to be some tradeo in selecting the synthesis units. Many di erent types of synthesis units have been explored. These include linguistically motivated units such as words, syllables, demi-syllables, diphones, and phones, as well as automatically derived variablelength units [4, 25]; other units include sub-phonetic segments corresponding to the states in a trained hidden Markov model (HMM) of a phone [5,12]. Diphones have been the most popular synthesis units because they provide a reasonable tradeo between capturing many coarticulation e ects, minimizing concatenation discontinuities, and being 10

relatively small in number. A diphone segment captures the transition between two phones by starting in the middle of the rst phone and ending in the middle of the second one. The endpoints are chosen to be in the middle of phones because coarticulatory in uences tend to be minimal at a phone's acoustic center which should help in reducing the discontinuity at concatenation boundaries. With 40 or so phonemes in English, there are only 1600 possible diphones and not all of them occur in natural speech. To improve coverage of highly coarticulated phone sequences spanning more than two phones, diphone inventories have typically been augmented to include some longer units that are three or four phones in length [28]. More recently there has been work with phone and sub-phone units primarily due to application of continuous speech recognition modeling techniques (e.g., HMMs) to the speech synthesis task [5, 12]. In the past, the use of phone-based units for concatenative synthesis was dicult because of the strong contextual variation in the acoustic realizations of each phone and the resulting problems of storing multiple variants of each phone, selecting the appropriate units during synthesis, and ensuring concatenation smoothness. The development of automatic selection algorithms to determine which segments to use in a particular context and the development of automatic segmentation and better concatenation algorithms to ensure smoother transitions have made it possible to explore these methods. The increased context sensitivity of these approaches should lead to the selection of more appropriate units and hopefully better quality synthetic speech. There has recently been a trend towards data-driven concatenative synthesis and away from rule-based synthesis. In [27], the following factors are cited for this trend: 1) the dramatic increase in storage capacity which reduced the importance of the compact size advantage of rule-based approaches; 2) the development of new algorithms that have signi cantly improved the smoothness of transitions between concatenated segments; and 3) the ability of concatenative approaches to produce dicult sounds by simply storing the corresponding waveforms without having to determine and model the details of how the sound is produced. In addition to these factors, other reasons include the increase in computing power, the development and availability of large text and speech corpora, and the development and successful use of data-driven and machine-learning methods in related elds such as continuous speech recognition.

5 Data-driven Approaches to Concatenative Speech Synthesis In this section, we begin with a description of the structure of a generic concatenative synthesis system and its various processing components. Then, for each component, we examine speci c data-driven and machine-learning approaches that have been developed and used for that task.

5.1 Components of a Concatenative Speech Synthesizer

Although there are many di erent approaches to concatenative speech synthesis, most can be described by a common structural block diagram. As illustrated in Figure 4, there are two major subsystems: one for training and the other for synthesis. In training, an inventory of synthesis units is created from a corpus of speech data. During synthesis, the phonetic and prosodic information specifying the utterance to be synthesized are taken as input. Appropriate units are then selected from the unit inventory, concatenated, and processed to create the output speech waveform. The training and synthesis subsystems are themselves composed of several processing components as shown in Figure 4. We now brie y describe these di erent components.

Speech Corpus The speech corpus serves as the source of synthesis units and generally consists of speech recorded from a single speaker. The corpus can be an existing speech database 11

Training Speech Corpus

Linguistic Parameters

Segmentation and Labeling

Segmented Speech Corpus

Inventory Design

Stored Unit Inventory

Speech Coding

Unit Inventory

Unit Selection

Concatenation

Signal Processing

Speech

Synthesis

Figure 4: Block diagram of the components of a concatenative speech synthesizer. that has been developed for use in other speech applications such as speech recognition or it can be custom designed and recorded just for speech synthesis purposes. In addition to the speech waveforms, the corresponding orthographic transcriptions of the utterances also need to be supplied.

Speech Segmentation and Labeling The rst processing step is to segment and label the speech corpus so that we know which sounds occur where. In addition to phonetic labels, other information such as prosodic features can also be computed and associated with the speech segments. This information is later used to help with unit inventory design, speech segment extraction, and unit selection.

Unit Inventory Design The goal of this component is to come up with a set of speech segments

that covers all the phonetic variations in the language and can be used to construct all the legal phone sequences. Other desirable properties for the synthesis unit inventory include accounting for as many coarticulatory e ects as possible, having minimal discontinuities at the concatenation points, and being as few in number as possible. The design of the unit inventory is probably the most important factor in a concatenative synthesis system since it impacts almost every other processing component. The content of the unit inventory determines the type and size of the speech corpora needed, how the synthesis units are selected and concatenated, and what type of signal processing can and needs to be done.

Speech Coding and Storage The unit inventory needs to be indexed and stored so that it can be

accessed during synthesis. Many systems excise the segmental units from the speech corpus and either store the raw waveforms or perform some speech analysis on the segments and store the resulting analysis parameters (e.g., LPC coecients). In other systems, pointers to locations in the original speech corpus are stored instead. The primary concerns are storage space and exibility of the parametric representation for later signal processing.

Unit Selection During synthesis, appropriate units have to be automatically selected and re-

trieved from the synthesis unit inventory based on the input phonetic and prosodic speci cations. The goal in the selection process is to choose the best units in terms of phonetic identity to match the sound to be produced, phonetic context to capture the coarticulation e ects, and prosodic characteristics to match the prosody as closely as possible. Other considerations include minimizing the discontinuity at segment concatenation boundaries. 12

Unit Concatenation After the appropriate units are selected, the next step is to concatenate

the individual segments into a contiguous stream. The goal here is to make the segment transitions as smooth as possible by minimizing the discontinuity at the boundaries. This is typically done using signal processing techniques to explicitly smooth the transitions. The degree of modi cation can be reduced if units are selected during synthesis and/or designed during training to minimize boundary discontinuities. The unit selection procedure can take into account the discontinuity at segment boundaries by modifying the objective function used in the search to include both segment match and segment concatenation criteria.

Prosodic Signal Processing In addition to the correct sequence of phones, the prosody also

needs to be appropriate. To obtain the prosody speci ed in the input, many systems perform some type of signal processing to modify the duration, fundamental frequency, energy, and possibly the spectral characteristic of the speech. Since the signal processing modi es the original speech, it can also degrade the naturalness of the speech. Therefore, it is important to minimize its use. The amount of modi cation can be reduced if units are selected during synthesis to match as closely as possible the required prosodic characteristics. In the limit, no prosodic modi cation is needed if the unit inventory is rich enough and includes appropriate prosodic labeling to enable well-matched units to be selected during synthesis.

5.2 Comparison of Approaches to Concatenative Synthesis In this section, we describe, compare, and analyze several di erent data-driven and machinelearning approaches to concatenative synthesis. In particular we examine the appropriate processing components of the following three speech synthesis systems: Bell Labs' synthesizer [27], Microsoft's Whistler [11,13], and ATR's Chatr [4]. The Bell Labs synthesizer takes a more traditional diphone-based approach to concatenative synthesis. An augmented diphone synthesis unit inventory is designed manually using expert linguistic knowledge. A relatively small speech corpus is then designed to cover all the needed units and recorded by a single speaker. One instance of each type of unit is automatically selected to minimize the discontinuity at the boundaries, excised from the corpus, parameterized using LPC analysis, and stored. During synthesis, appropriate units are rst selected from the inventory with a bias for longer units. Next, they are concatenated by linearly interpolating the LPC parameters to adjust their durations. Finally, the fundamental frequency contour and glottal model parameters are speci ed and LPC synthesis is used to create the speech waveform. The Microsoft Whistler synthesizer uses a more data-driven approach to concatenative synthesis. The synthesis unit inventory is obtained by automatically clustering context-dependent phones derived from the speech corpus and then selecting multiple instances for each type. The speech segments are then excised, parameterized using LPC analysis, and stored. During synthesis, an optimal sequence of units is dynamically selected from the inventory based on an objective function that takes into account phonetic match, concatenation distortion, and prosodic mismatch distortion. Concatenation and prosodic processing is done by modifying the LPC and source model parameters in a manner similar that done in the Bell Labs system. The ATR Chatr synthesizer represents a more extreme data-driven approach to concatenative synthesis. The synthesis unit inventory is made up of a large number of non-uniform length units that are automatically derived from a large speech corpora. Multiple instances of each type are kept to improve coverage of segmental and prosodic variations. Instead of extracting the segmental units, only pointers to locations in the original speech waveforms are stored. During synthesis, a search procedure that takes into account unit and continuity distortion measures is used to nd the 13

Speech Corpora

Bell Labs Synthesizer

Specially designed utterances to cover units in a controlled context. Small corpus of single speaker speech data. Manually segment and label synthesis units.

Segmentation and Labeling Unit Inventory Augmented diphones manually Design designed using linguistic knowlSpeech Coding and Storage Unit Selection Unit Concatenation Prosodic Signal Processing

edge and acoustic analyses. Small number of units. Single instance for each type. Extract segmental units. Perform LPC analysis and store parameters. Normalize amplitude of units. Rule-based unit selection with preference for longer units in ambiguous cases. Interpolate LPC parameters at segment boundaries. Alter F0 , duration, energy, LPC parameters, and glottal model parameters.

Microsoft Whistler

ATR Chatr

Large corpus of single speaker stereo speech data (microphone and laryngograph) with orthographic transcriptions. Automatic alignment using supervised HMM decoding. Pitch synchronized segments. Automatically derived decision tree clustered contextdependent phones. Scalable number of units. Multiple instances for each type. Extract segmental units. Perform pitch-synchronous LPC analysis and store parameters.

Large corpus of single speaker speech data with orthographic transcriptions.

Store pointers to segment locations in original waveform and associated feature vectors.

Dynamic programming search to nd best unit sequence. Minimize phonetic, prosodic, and concatenation distortions. Interpolate LPC parameters at segment boundaries. Alter F0 , duration, LPC parameters, and glottal model parameters.

Viterbi search to nd best unit sequence. Minimize phonetic, prosodic, and concatenation distortions. Concatenate waveform segments with no extra processing. Little to no signal processing for prosodic modi cation. If done, TD-PSOLA is used.

Automatic HMM alignment. Compute acoustic, phonetic, and prosodic features. Automatically derived variable length phone sequences. Large number of units. Multiple instances for each type.

Table 1: Comparison of the three speech synthesis systems across various dimensions. best sequence of units. The corresponding waveforms are then retrieved and concatenated to create the output speech. No signal processing is performed to smooth the concatenation boundaries or to modify the prosodic characteristics. Instead, it is up to the unit selection procedure to select the units with minimal concatenation distortion and the best matching prosody. To enable this, a large speech corpus is needed in order to cover as many of the di erent context variations as possible. The segmental units must also be labeled with acoustic and prosodic features in addition to the standard phonetic identity and context information to facilitate unit selection. We now compare the approaches used by the three di erent synthesis systems across various analysis dimensions. These dimensions correspond to the set of system components described earlier. Table 1 summarizes the comparison with detailed discussions in the following subsections.

5.2.1 Speech Corpus In the Bell Labs synthesizer, the synthesis unit inventory is designed rst. This means that the utterances in the speech corpus can be designed so that all the units in the inventory can be covered with a reasonably small number of sentences. Limiting the amount of speech that needs to be recorded can improve the consistency of the speech by allowing it to be collected in a single session and without fatiguing the speaker. One approach is to use a greedy algorithm that selects the minimal number of words needed to cover the given set of units. The drawbacks of this approach are that most units will have only one occurrence in the set of words and there is little control over the phonetic context in which the units will appear. An alternative approach is to embed each unit in the same carrier phrase so that the prosodic context and position in the phrase can be controlled. To avoid over articulation and to ensure that all units have uniform stress, the syllable containing the unit is also constrained to have only secondary stress. To provide coverage of di erent phonetic contexts for candidate inventory units, the contextual place of articulation for each unit is systematically varied. 14

A large corpus of read speech from a single speaker is used in both the Microsoft Whistler and ATR Chatr systems. Since Whistler performs a pitch-synchronous LPC analysis of the speech segments, reliable voicing and pitch period locations need to be estimated. To facilitate this, stereo recordings are made using a microphone and a laryngograph. The laryngograph measures vocal cord movement and its signal can be used for estimating the voicing information. This additional processing is avoided in Chatr because it does not do LPC analysis; rather it operates directly with the speech waveforms. One goal of Chatr is to enable the use of existing natural speech corpora for synthesis without having to collect new specialized data. Since the unit inventories in both these systems are derived automatically from the speech data, it is important to have as large a speech corpus as possible to ensure that there are many occurrences of di erent speech sounds in a variety of contexts. In addition to the speech waveforms, an orthographic transcription of the utterances and a pronunciation dictionary that maps words to phones also need to be supplied. These are used to help in the automatic alignment of the phonetic labels and the speech segments.

5.2.2 Speech Segmentation and Labeling In the Bell Labs system, the specially recorded speech corpus is segmented and labeled manually to mark the location and identity of the candidate synthesis units. The particular instance of each unit to use and its precise segment boundaries are determined later using an automatic procedure. In the Whistler and Chatr systems, the segmentation and labeling of the speech data is performed automatically using continuous speech recognition techniques. For each utterance, the corresponding orthographic transcription is rst converted to a phone sequence using a pronunciation dictionary. Trained HMM models of the phones are then aligned to the speech waveform using the Viterbi search algorithm. This procedure is sometimes referred to as \forced alignment" or \supervised decoding" because the correct sequence of phones is known. The alignment provides a start and end time for each phone which is then used for unit inventory design, unit extraction, and unit selection. Although automatic segmentation is not perfect, the consistency of the segmentation combined with appropriate context labeling of the segments can largely overcome this problem. In the Whistler system, the HMM phone models were ne tuned after error analysis revealed that vowel-stop transitions were being included with the stop segment rather than the vowel segment. To correct this, the number of states in the HMM models for stops were reduced from three to two and for fricatives from three to one. Reducing the number of states reduces the length and variability of the segment the HMM model can represent. To enable pitch-synchronous LPC analysis, an additional processing step is needed to pitch-synchronize the aligned segments. Since the segmentation derived from the HMM alignment is based on uniform time frames rather than pitch synchronized frames, each segment boundary needs to be moved to an appropriate pitchsynchronized location. Di erent criteria can be used to decide which pitch-synchronized boundary is appropriate including choosing the one nearest in time or closest in pitch value. In addition to assigning phonetic labels to each of the speech segments in the corpus, the Chatr system also computes acoustic and prosodic features for each segment to enable further distinctions between otherwise identical segmental contexts. The acoustic features include cepstral coecients and energy. The prosodic features include duration, fundamental frequency, waveform envelope amplitude, spectral energy at the fundamental, harmonic ratio, and degree of spectral tilt. These values are then normalized for each phone class so that the distribution has zero mean and unit variance. First di erences of these features are also computed to indicate the direction and magnitude of the change. These additional features are used in the design of the unit inventory and in the unit selection process during synthesis. 15

5.2.3 Unit Inventory Design The unit inventory in the Bell Labs systems is composed of diphone units augmented with additional longer polyphone units. The design of the unit inventory is done in two stages. The rst stage is rule-based where general acoustic, phonetic, and phonological principles are used to reduce the number of possible diphone units. Phonotactic constraints can be used to eliminate transitions that do not occur in the language and diphones with minimal coarticulation can be removed. For example, many consonant-consonant transitions contain a short period of silence. In these cases, the connection can be replaced with consonant-silence and silence-consonant units without losing important transitional information. This also reduces the number of units needed to cover these transitions from n2 to just 2n in the case of n consonants. The second stage is data-driven and consists of performing acoustic analyses on the recorded speech to decide what phones need to be distinguished and which ones can be grouped together, which units need to have multiple instances to capture contextual and prosodic variations, and what polyphone units are needed to model strong coarticulatory e ects. Much of this analysis is done manually and can be very time and labor intensive. However, a careful design process can drastically reduce the number of units in the inventory and can simplify the inventory unit collection process by reducing the size of the speech database that needs to be recorded. To create the nal synthesis unit inventory, one token of each type of unit needs to be selected from the multiple candidates. The particular token and its segment boundaries are automatically selected to minimize segmental distortion and concatenation discontinuity. The search process begins by nding a particular feature vector, called an \ideal point" (IP), such that for each unit type, there is at least one token that gets to within a distance  of IP at some point in its feature trajectory. This ideal point can then be used to determine the segment boundaries by selecting the frame closest to IP. This method guarantees that the spectral discrepancy at the concatenation boundary of any two such units is at most 2. By make  small, the concatenation discontinuities can be reduced. The synthesis unit inventory in the Whistler system is composed of context-dependent phones that are automatically derived from analyzing a large speech corpus. The units are determined by rst constructing a large set of context-dependent phones (triphones, quinphones, stress-sensitive phones, and word-dependent phones); training HMM models for each of these context-dependent phone; then clustering them using decision trees to group similar context-dependent phones together; and nally selecting a subset of the phones from each cluster to be stored in the inventory. The decision tree is generated automatically from the speech database. The rst step in building the decision tree is to formulate a large set of yes/no linguistic questions such as \Is the left phone a nasal?" or \Is the right phone a vowel?". Next, a training set of context-dependent phones is recursively split by choosing the question that results in the minimum within-unit distortion computed using the HMM phone models. Basically the goal is to choose the split that groups the most similar units together. Each split corresponds to a non-terminal node in the binary decision tree and the selected question becomes the splitting criterion for that node. The tree is recursively grown until the desired number of terminal or leaf nodes is reached or the number of phone tokens in the leaf nodes falls below some threshold. A subset of the phones from each leaf node is then selected to be in the unit inventory. The number of leaf nodes therefore determines the number of synthesis units and the size of the inventory can be controlled by specifying the depth of the decision tree. The structure of the decision tree provides a nice mechanism to smoothly trade-o unit inventory size and unit speci city. A small inventory requires a shallow tree which means the clusters are less context speci c since they are created based on a small number of questions. A large inventory means a deep tree with more questions and, as a result, more context speci c 16

clusters. The decision tree also allows generalization to new contexts not seen in the training data by backing-o to broader phonetic categories for the neighboring contexts. A subset of the context-dependent phones from each leaf node is selected to be in the unit inventory by choosing the tokens that are most appropriate for concatenation and prosody modi cation. The selection procedure rst computes statistics for duration, energy, and pitch, and removes those tokens with values far away from the average. A small number of the remaining units are then selected based on an objective function that measures how well the token can represent the cluster. The duration normalized HMM score, which measures how well an individual token matches the distribution of the cluster, is used as the objective function. Tokens with the highest HMM scores are then chosen to represent the cluster. In the Chatr system, the synthesis unit inventory is composed of non-uniform length phone sequences that are automatically derived from analyzing a large speech corpus. The algorithm for determining the set of units incrementally constructs longer units and proceeds as follows. First, counts of all current units are computed; next, the most frequently occurring unit is conjoined with its mostly frequently co-occurring neighbor to produce a new (and longer) unit; then, the cycle is repeated with the new set of units. The iteration terminates when the number of tokens for any unit drops below a speci ed threshold. The threshold is chosen based on the size of the corpus and the number of di erent unit types desired. A larger threshold produces fewer unit types but with more tokens per type. With each iteration of the clustering process, the number of di erent unit types increases and the distribution of the di erent units becomes more uniform. It should be noted that this inventory design procedure does not guarantee that the most dicult phonetic transitions are included in the inventory; only the more common phone sequences in the corpus are assured of being discovered. To improve coverage of segmental and prosodic variations, multiple instances of each unit type are selected and included in the unit inventory. The particular instances are determined by vector quantizing the feature vectors of all the tokens of that type into n clusters and then choosing the instance closest to each cluster centroid. In this way the n most prosodically diverse tokens are selected. The choice of n is a trade-o between compact size and synthetic speech quality. The larger the unit inventory and the more variants there are of each unit type, the more likely a unit can be found during synthesis that closely matches the target segment. The selection of appropriate units minimizes the amount of modi cation needed to smooth unit transitions and to match prosodic characteristics. More recently, the Chatr system advocates the use of all segments in the speech corpus as the synthesis inventory. In this situation, inventory design becomes simply an indexing of the speech corpus with appropriate phonetic labels and feature vectors, and synthesis becomes a retrieval and resequencing of the appropriate speech segments [3].

5.2.4 Speech Coding and Storage After the content of the synthesis unit inventory has been determined, the relevant segments have to be indexed and stored so that they can be accessed during synthesis. In the Bell Labs systems, the relevant acoustic units are excised from the speech corpus based on the minimal discontinuity endpoints determined during the design and selection process, parameterized using LPC analysis, and then stored in a single indexed le. Each acoustic unit is normalized so that the amplitude of each phone in the unit has the average amplitude for that phone. In the Whistler system, speech segments comprising the synthesis unit inventory are excised from the speech corpus based on the automatic alignment boundaries, parameterized using a pitchsynchronous LPC analysis, and then stored in a single indexed le. The parameterization of the 17

speech segments using LPC analysis allows more ecient storage of the unit inventory and facilitates concatenation boundary smoothing and prosodic modi cation. Unlike the other two systems, the acoustic units in the Chatr synthesizer are not extracted from the speech corpus to create a synthesis unit inventory. Instead, pointers to the appropriate locations in the original speech waveforms are stored along with the corresponding acoustic, phonetic, and prosodic feature vectors computed during the segmentation and labeling phase to create an indexed speech corpus. The speech segments are excised from the corpus during synthesis.

5.2.5 Unit Selection In the Bell Labs system, the phonetic sequence speci cation that is input to the synthesis component is converted into an appropriate sequence of synthesis units using a set of rewrite rules. Because the diphone units are augmented with additional units during the inventory design process, the mapping is not deterministic as it would be if the unit inventory was purely diphonic; multiple unit sequences can now cover the same input phone sequence. Deviations from a strictly diphonic structure include the following: 1) Not all phone pairs are stored in the inventory; 2) all fricatives have a special steady-state unit with a single phone label; 3) there can be multiple versions of a unit that only di er by phonetic context; and 4) triphone or even longer units can be stored in the inventory. The goal of the rewrite rules is therefore to select the unit sequence that optimally (according to some criteria) covers the input phonetic sequence. The mapping process steps through the input phonetic sequence from left to right attempting to match the longest possible phone substring to an existing inventory unit. If only one unit matches the current phone substring, then that unit is selected. If multiple units match, then the longest unit is selected. The search then continues starting with the ending phone of the previous matching unit until the entire input sequence is traversed. There are several special cases. For units starting with a stop, a silence unit is inserted right before the unit to model the stop closure. For fricatives, a steady-state fricative unit is inserted between the matching diphone units. For the phone [h], an appropriate context-dependent version is selected from the inventory to best match the following phonetic context. In the Chatr system, a Viterbi search algorithm is used to select the sequence of synthesis units which are closest to the sequence of target segments in terms of relevant features and which will concatenate well together to minimize continuity distortion [14]. Therefore, the cost function used in the search is composed of two components: a target cost and a concatenation cost. The target cost, C (t ; u ), represents the di erence between a target segment t and an inventory unit u and is calculated as the weighted sum of the target subcosts, C (t ; u ), between the elements of the target and candidate feature vectors: t

n

n

n

t i

t

(

C tn ; u n

)=

X p

=1

t

t

(

wi Ci tn ; un

n

n

n

)

i

where w is the weight for subcost i, and p is the number of elements in the feature vectors which varies between 20 and 30. Each target segment and inventory unit is characterized by a feature vector consisting of acoustic, phonetic, and prosodic features such as cepstral features, duration, pitch, energy, phonetic identity, phonetic context, and various distinctive features. The concatenation cost, C (u ?1 ; u ), represents the smoothness between two consecutive inventory units and is determined by the weighted sum of q concatenation subcosts, C (u ?1 ; u ): t i

c

n

n

c i

(

C c un?1 ; un

)=

X q

=1

i

18

(

wic Cic un?1 ; un

)

n

n

where w is the weight for subcost i. Currently, three subcosts (q = 3) are used: cepstral distance at the point of concatenation and absolute di erences in log power and pitch. If u ?1 and u are consecutive units in the speech database, then their concatenation is natural and has a cost of zero. This encourages the selection of multiple consecutive phones from the speech database. The best sequence of inventory units, u^1 , is determined by minimizing the total cost for a sequence of N units using a dynamic programming search: c i

n

n

N

^ = 1min n C (t1 ; u1 ) = 1min n N

N

u1

u ;:::;u

N

u ;:::;u

(X N

=1

t

(

C tn ; u n

n

)+

X

)

N

=2

C

c

(u ?1 ; u ) n

n

n

In order to obtain near real-time performance with a large speech database, some pruning of the search space is required. This is done in multiple steps. First, units with phonetic contexts similar to the target segments are identi ed. Next, the remaining units are pruned using the target cost and nally with the concatenation cost. Real-time performance can be achieved on a database with 100,000 units using a beam width of 10-20 units. The weights for the di erent subcosts (w and w ) determine the relative importance of the di erent features. Two approaches have been used to automatically train these weights using the speech database. One performs a limited search of the weight space to nd a good set of weights. The other involves an exhaustive comparison of the units in the database and linear regression to estimate the weights. Both training methods use targets drawn from natural utterances held out from the synthesis database. The goal of the training process is to determine the weights which minimize the di erence between the natural waveform and the synthesized waveform generated by the synthesizer when given the target speci cations of the natural waveform. In other words, the weights are trained by improving the mimicking performance of the synthesizer. The distance measure currently used to determine the di erence between the natural and synthesized segments is the mean cepstral distance. Although this distance measure is correlated with human perception, it may not be optimal in maximizing the quality of the synthesized speech. Comparisons of human perceptual measures and the cepstral distance measure show that people are more sensitive to continuity distortions while the cepstral distance measure gives more importance to unit distortions. Clearly a better distance measure that is more indicative of perceived speech quality is needed. The Whistler system also uses a dynamic programming search to nd the optimal sequence of synthesis units (context-dependent phones) to concatenate [11]. The search tries to minimize an objective function that takes into account phonetic mismatch (via HMM scores), unit concatenation distortion, and prosody mismatch distortion. Similar to the Chatr objective function, the concatenation cost between two units is zero if they occurred in sequence in the original speech database. If not, the concatenation cost is based on spectral distortion measures at the boundary and the phonetic identity of the unit to re ect higher tolerance of mismatches for some sounds (fricatives) over others (vowels). For each target segment, a set of candidate units on which to perform the search is rst selected by using the context-dependent phonetic decision tree created during the design of the unit inventory. The target segment is rst presented to the root of the tree; the tree is then traversed by answering questions at the non-terminal nodes until a leaf node is reached; associated with the leaf node is a set of context-dependent phone units that should be close matches for the target segment. t i

c i

5.2.6 Unit Concatenation After the appropriate sequence of units is selected from the unit inventory and retrieved the next step is to concatenate the segments together. In the Bell Labs synthesizer, concatenation is done by 19

interpolating the LPC parameters at the boundary frames of the speech segments. There are two di erent operations depending on the identity of the boundary phone (the phone that terminates the left unit and starts the next one), and on whether the target duration of the boundary phone is greater or smaller than the phone's duration in the two segments. For vowels and semi-vowels where the target duration exceeds the duration in the segments, new frames of LPC parameters are generated by linearly interpolating between the nal and initial frames of the two segments. In the remaining cases where the boundary phone is a consonant or a vowel/semi-vowel that needs to be shortened, the spacing between the frames of the phone are uniformly adjusted to achieve the target duration; no frames are discarded. Although the Whistler papers do not elaborate on the speci c LPC parameter interpolation scheme used to concatenate the sequence of selected speech segments, we assume that they do something similar to what is done in the Bell Labs system. In the Chatr synthesizer, no signal processing is performed at the concatenation boundaries. The selected sequence of waveform segments are simply joined together. Chatr relies on the unit selection procedure to select the appropriate unit sequence with minimal concatenation discontinuities.

5.2.7 Prosodic Signal Processing Since both the Bell Labs and Whistler systems encode speech segments in the synthesis unit inventory using LPC analysis parameters and then use LPC synthesis to reconstruct the speech waveforms, they both perform prosody modi cation of the speech in similar ways. As discussed in Section 4.1, the LPC source- lter analysis/synthesis model allows independent control of many parameters that can be used for adjusting prosodic characteristics. The fundamental frequency, voiced/unvoiced decision, and glottal excitation parameters can be speci ed to the source model; the amplitude of the signal is a separate parameter; and duration and spectral modi cations can be made by appropriately changing the LPC lter coecients. The desire to improve the quality of synthesized speech has prompted the development and use of more realistic models of the voice source. In Whistler, di erent LPC methods with more complex source models have been used including residual excited LPC and codebook excited LPC. The Bell Labs system uses a more complex glottal ow model where voice quality parameters such as spectral tilt, open quotient, and aspiration noise can be speci ed [21]. The model can be dynamically controlled during synthesis using rules to change the voice quality of the speech based on the prosodic context. The Chatr synthesizer, on the other hand, advocates little to no signal processing for prosodic modi cation. They argue that since prosodic signal processing degrades the naturalness of the speech, it is important to minimize its use. They rely on the unit selection procedure to select units whose phonetic and prosodic feature vectors best match those of the target segments to minimize the amount of modi cation needed. Since a raw waveform representation is used for the units, if any prosodic signal processing is done, it will likely use the PSOLA (pitch synchronous overlap and add) procedure [20]. PSOLA is a non-source- lter method that enables speech segments to be smoothly concatenated while enabling the pitch and duration of the segments to be altered. There are three processing stages. First the speech is analyzed into many short-term signals by windowing the waveform pitch-synchronously in regions of voiced speech and at xed intervals in regions of unvoiced speech. Second, the speech is modi ed by manipulating the number, spacing, and shape of the short-term signals. Pitch is changed by altering the spacing of the short-term signals. Duration is modi ed by adding or removing integral numbers of the short-terms signals. Finally, the new sequence of shortterm signals are recombined using overlap and add methods to create the modi ed and concatenated 20

speech waveform. PSOLA can also be used in conjunction with the LPC representation. PSOLA processing can either be applied to the waveform resulting from LPC resynthesis or to the excitation signal generated by the source model before it passes through the LPC lter [20].

6 Discussion and Conclusions As we have seen, data-driven and machine-learning methods have been developed and used in almost every processing component of a speech synthesis system. The use of these automatic methods combined with the availability of large text and speech corpora has enabled large amounts of data to be analyzed and the trends and characteristics that occur the data to be modeled. For example, in the design of unit inventories for concatenative synthesis, the use of statistical clustering methods has allowed more specialized speech segments that are sensitive to phonetic and prosodic contexts to be derived from automatic analysis of a large corpus of speech. These context-dependent units can better match the characteristics of target segments, resulting in more natural synthetic speech by reducing the amount of signal processing modi cation needed. Deriving these units and performing this type of data analysis would be almost impossible with manual methods. Currently no concatenative synthesis approaches model coarticulation using rules as traditionally done in formant and articulatory synthesis approaches. Instead, they rely on the acoustic unit inventory design process to produce appropriate speech segments to capture coarticulation phenomena. However, in most systems, prosody is still modeled using rules and generated by modifying the acoustic features of the speech segments. An interesting question is \will capturing prosodic e ects follow the same path as coarticulation phenomena and be modeled in the design of the synthesis units?" In some systems, like ATR's Chatr, prosodic features are already used in the unit inventory design process. They are also used in the unit selection process during synthesis and result in improved spectral match between the synthetic and natural speech [4]. The Whistler system only uses prosodic features to help in selecting appropriate units during synthesis. They mention including prosodic features in the design of the unit inventory as possible future work [11]. In [27,30], they argue that trying to capture prosodic variations in the unit inventory can only work in extremely limited domains due to the vast combinatorial space of possible phone sequences and prosodic contexts. Additional complications include the highly variable nature of speech and the fact that only a limited amount of speech can be successfully recorded by a speaker in one session. Although many of the combinatorially possible units are very rare and can probably be safely ignored, the number of rare units is very large and the probability of one rare unit occurring in an arbitrary sentence can be large. Because of these considerations, the authors in [27,30] conclude that prosodic modi cation of speech segments is a necessity and the development of better signal processing methods is important. While complete coverage of all possible phonetic and prosodic contexts by the unit inventory is not likely, signi cant coverage is possible. Even \large" synthesis unit inventories of 100,000 units only comprise about three hours of speech, much smaller than the size current speech databases which contain hundreds of hours of speech. Also, the fact that a rare unit appears in a sentence does not mean that the entire utterance is bad. It may be perfectly acceptable to have a small number of larger errors in an otherwise natural sounding utterance than to have a large number of smaller distortions a ecting the entire utterance. Obviously, prosodic modi cation via signal processing does not have to be completely eliminated; both increased coverage and signal processing can be used together. With larger unit inventories and better contextual coverage, the amount of signal processing needed to modify the synthesis units to match the characteristics of the target segments can be reduced. In either case, improving signal processing methods for prosody modi cation and 21

concatenation smoothing remains an important research area. The availability of common text and speech corpora can also be used to perform more objective evaluations of speech synthesis systems like it has been done in speech recognition. The use of common training and test sets eliminates data variability between di erent systems and allows the comparison to focus on the di erent algorithms and approaches. Evaluation of speech synthesis output is de nitely more subjective than evaluation of speech recognition output. Traditionally human listening tests are used to judge the intelligibility and naturalness of synthetic speech. For phone-level intelligibility, the diagnostic rhyme test (DRT), the modi ed rhyme test (MRT), and the cluster identi cation test have been used. For sentence-level intelligibility, the semantically unpredictable sentences test has been used. And for naturalness, the paired comparison and the mean opinion score test have been used [6]. One component of a concatenative synthesizer, the speech synthesis subsystem, can be more objectively evaluated using common training and testing corpora. A standard training corpus can be used as the source for the synthesis unit inventory. This controls variability in the quality, quantity, and coverage of phonetic and prosodic phenomena in the speech database. The synthesis task can then be copy synthesis or mimicking of natural utterances contained in a separate and independent test set. Target phonetic and prosodic features are derived from the natural speech test sentences and input to the trained synthesizer. Obtaining the target features in this way isolates the speech synthesis subsystem from the natural language processing components. Synthesis performance can then be evaluated by comparing the synthesized waveforms with the corresponding natural waveforms using an appropriate distance measure. The de nition of an appropriate measure of the similarity between a synthetic and a natural utterance is not straightforward. Since humans are the end consumers of synthesized speech, it is important that any metric used to measure the quality of synthetic speech be sensitive to the factors that are important for human speech perception. In the synthesis systems examined, the cost functions and distance measures used in the clustering and token selection processes during inventory design and the unit selection process during synthesis may not be optimal from a speech perception point. Many of the measures are derived from those used in automatic speech recognition, which has very di erent requirements. For example, during evaluations of the Chatr system, it was found that large di erences in perceived synthetic speech quality can sometimes have only small di erences in the cepstral distance [4]. Comparisons of human perceptual measures and the cepstral distance measure also show that people are more sensitive to continuity distortions while the cepstral distance measure gives more importance to segmental distortions. In addition, cepstral distance as well as many other measures are not sensitive to inappropriate durations while human perception is very sensitive to timing errors. There is de nitely a need to develop new objective measures that better approximate human perceptual preferences. The automatic approaches to unit inventory design that we have examined are not designed to explicitly capture the most dicult phonetic transitions. For example, the goal of the approach used in the Chatr system to obtain non-uniform length units is to nd the most common phone sequences in the corpus. Modeling the most frequent sequences (and hence their transitions) may or may not capture the most important or dicult transitions. The automatically derived contextdependent phone units used in the Whistler system, by de nition, don't even capture phonetic transitions since they are only one phone long. They rely on having a large enough number of context sensitive phones so that the appropriate one will be used in the speci ed environment. Only the unit inventory design procedure used in the Bell Labs system explicitly tries to capture the most important and dicult transitions. It should be possible to change the objectives of the automatic inventory design procedures to take into account dicult phonetic transitions. One advantage of concatenative synthesis approaches is that the synthesized speech is able to retain the characteristics of the donor speaker. However, a major disadvantage is the inability to 22

easily change the speaker characteristics. Rule-based synthesis approaches such as formant and articulatory synthesis, on the other hand, provide a straightforward way to modify the speaker characteristics by changing the parameters of the model. Although this is simple in principle, it may not be easy to do in practice since the speech of a new speaker has to be analyzed to determine the appropriate modi cations to the model parameters. In concatenative approaches, the only way to modify the speaker characteristics is to collect speech from a new speaker and create a new unit inventory. Although this sounds dicult, it can actually be done in a reasonable amount of time with the use of automatic data-driven methods. Another possible approach is to make use of speaker adaptation methods developed for speech recognition to appropriately transform the synthesis models based on a small sample of speech from a new speaker. The development and use of automatic data-driven and machine-learning algorithms has become pervasive. They are continuously being improved and an increasing number of speech synthesis approaches and systems are being developed using these methods. For these reasons, it may be the case that \the corpus-based approach is the key to understanding current research directions in speech synthesis and to predicting the future outcome of synthesis technology." [24]

References [1] J. Allen, S. Hunnicut, and D. Klatt, From Text To Speech, The MITTALK System. Cambridge University Press, 1987. [2] C. Bickley, K. Stevens, and D. Williams, \A framework for synthesis of segments based on pseudoarticulartory parameters," in Progress in Speech Synthesis (J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, eds.), pp. 211{220, New York: Springer, 1997. [3] N. Campbell, \CHATR: A high-de nition speech re-sequencing system," Acoustical Society of America and Acoustical Society of Japan, Third Joint Meeting, Dec. 1996. [4] N. Campbell and A. Black, \Prosody and the selection of source units for concatenative synthesis," in Progress in Speech Synthesis (J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, eds.), pp. 279{292, New York: Springer, 1997. [5] R. Donovan and P. Woodland, \Automatic speech synthesizer parameter estimation using HMMs," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, pp. 640{643, May 1995. [6] T. Dutoit, An Introduction to Text-To-Speech Synthesis. Kluwer Academic Publishers, 1997. [7] T. Ezzat and T. Poggio, \Video-realistic talking faces: A morphing approach," in Proceedings of the Audiovisual Speech Production Workshop, Rhodes, Greece, September 1997. [8] G. Fant, Acoustic Theory of Speech Production. The Hague:Mouton, 1960. [9] H. Fujisaki and H. Kawai, \Realization of linguistic information in the voice fundamental frequency contour of the spoken japanese," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, NY, pp. 663{666, 1988. [10] J. Holmes, \The in uence of the glottal waveform on the naturalness of speech from a parallel formant synthesizer," IEEE Transactions on Audio and Electroacoustics, vol. AU-21, pp. 298{305, 1973. [11] H. Hon, A. Acero, X. Huang, J. Liu, and M. Plumpe, \Automatic generation of synthesis units for trainable text-to-speech systems," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA, vol. 1, pp. 293{296, May 1998. [12] X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe, \Whistler: A trainable text-to-speech system," in Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, vol. 4, pp. 2387{2390, Oct. 1996.

23

[13] X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Meredith, and M. Plumpe, \Recent improvements on microsoft's trainable text-to-speech system - whistler," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, pp. 959{962, May 1997. [14] A. J. Hunt and A. W. Black, \Unit selection in a concatenative speech synthesis system using a large speech database," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, pp. 373{376, May 1996. [15] D. H. Klatt, \Review of text-to-speech conversion for english," Journal of the Acoustical Society of America, vol. 82, pp. 737{793, 1987. [16] D. H. Klatt and L. C. Klatt, \Analysis, synthesis, and perception of voice quality variations among female and male talkers," Journal of the Acoustical Society of America, vol. 87, no. 2, pp. 820{857, 1990. [17] B. LeGeo , T. Guiard-Margny, and C. Benoit, \Analysis-synthesis and intelligibility of a talking face," in Progress in Speech Synthesis (J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, eds.), pp. 235{246, New York: Springer, 1997. [18] J. Lucassen and R. Mercer, \An information theoretic approach to the automatic determination of phonemic base forms," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, San Diego, pp. 42.5.1{42.5.4, 1984. [19] P. Mertens, F. Beaugendre, and C. d'Alessandro, \Comparing approaches to pitch contour stylization for speech synthesis," in Progress in Speech Synthesis (J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, eds.), pp. 347{363, New York: Springer, 1997. [20] E. Moulines and F. Charpentier, \Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Communication, vol. 9, no. 5/6, pp. 453{467, 1990. [21] L. Oliveira, \Estimation of source parameters by frequency analysis," in Proceedings of the European Conference on Speech Technology, Berlin, Germany, vol. 1, pp. 99{102, ESCA, Sept. 1993. [22] J. Pierrehumbert, \Synthesizing intonation," Journal of the Acoustical Society of America, vol. 70, no. 4, pp. 985{995, 1981. [23] M. Riley, \Tree-based modeling for speech synthesis," in Talking Machines: Theories, Models, and Designs (G. Bailly and C. Benoit, eds.), pp. 265{273, North-Holland, 1992. [24] Y. Sagisaka, \Spoken output technologies - overview," in Survey of the State of the Art in Human Language Technology (R. Cole, ed.), pp. 189{195, National Science Foundation, 1995. [25] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, \ATR nu-talk speech synthesis system," in Proceedings of the International Conference on Spoken Language Processing, Ban , Canada, pp. 483{486, Oct. 1992. [26] T. Sejnowski and C. Rosenberg, \Nettalk: A parallel network that learns to read aloud," Technical report JHU/EECS-86/01, Johns Hopkins University, Baltimore, MD, 1986. [27] R. Sproat, Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, 1998. [28] R. Sproat and J. Olive, \An approach to text-to-speech synthesis," in Speech Coding and Synthesis (W. Kleijn and K. Paliwal, eds.), pp. 611{633, Amsterdam, Holland: Elsevier Science, 1995. [29] J. P. H. van Santen, \Computation of timing in text-to-speech synthesis," in Speech Coding and Synthesis (W. Kleijn and K. Paliwal, eds.), pp. 663{684, Amsterdam, Holland: Elsevier Science, 1995. [30] J. P. H. van Santen, \Combinatorial issues in text-to-speech synthesis combinatorial issues in text-tospeech synthesis," in Proceedings of the European Conference on Speech Technology, Rhodes, Greece, pp. 2511{2514, Sept. 1997. [31] S. Young and F. Fallside, \Speech synthesis from concept: A method for speech output from information systems," Journal of the Acoustical Society of America, vol. 66, no. 3, pp. 685{695, 1979.

24