Nonlinear Emotional Prosody Generation and

Nonlinear Emotional Prosody Generation and Annotation 1 Jianhua Tao, Jian Yu, Yongguo Kang National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, P.O.X. 2728, Beijing, 100080 {jhtao, jyu, ygkang}@nlpr.ia.ac.cn

Abstract. Emotion is an important element in expressive speech synthesis. The paper makes the brief analysis on prosody parameters, stresses, rhythms and paralinguistic information in different emotional speech, and labels the speech with rich annotation information in multi-layers. Then, a CART model is used to do the emotional prosody generation. Unlike the traditional linear modification method, which makes direct modification of F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations and intensities, the CART models try to map the subtle prosody distributions between neutral and emotional speech within various context information. Experiments show that, with the CART model, the traditional context information is able to generate a good emotional prosody outputs, however the results could be improved if more rich information, such as stresses, breaks and jitter information, are integrated into the context information.

1.

Introduction

Recently, more and more efforts have been made in the research of expressive speech synthesis, among which emotion is a very important element [1, 2]. Some prosody features, such as pitch variables (F0 level, range, contour, and jitter), and speaking rate have already been analyzed [3,4]. There are also some implementations in emotional speech synthesis. For instance, Mozziconacci [5] added emotion control parameters on the basis of tune methods, resulting in higher performance. Cahn [6], by means of a visual acoustic parameters editor, achieved the output of emotional speech with manual inferences. Recently, some efforts have been made using a large corpus. A typical system was produced by Campbell [7], who created an expressive speech synthesis from a five years’ large corpus that gave impressive synthesis results. Schroeder[8], Eide[9] generated an expressive TTS engine which can be directed, via an extended SSML, to use a variety of expressive styles from about ten hours of “neutral” sentences. Furthermore, rules translating certain expressive elements to ToBI markup have been manually derived. Chuang[10] and Tao[11] used emotional keywords and emotion trigger words to generate an emotional TTS system. The final emotion state is determined based on the emotion outputs from text-content module. The results were used in the dialogue systems to improve the naturalness/expressiveness of the answering speech. 1

The paper was supported by the National Natural Science Foundation of China under Grant (60575032).

As we see, most of current emotional speech synthesis systems are still based on the linear modification method (LMM) on prosody parameters (some of them are also able to make the voice quality control), except unit selection methods. The LMM makes direct modification of F0 contours (F0 top, F0 bottom and F0 mean), syllabic durations and intensities from the acoustic distribution analysis results. The previous analysis shows that the expression of emotion does not just influence these general prosody features, but also affects more subtle prosodic features, such as stresses, breaks, jitter, etc. With this idea, we annotate the emotional speech in more detailed way, and a CART model which can link linguistic features to the prosody conversion is used. To decrease the dimensionality of output prosody parameters, we also use the pitch target model [12] for output prosody parameters. The model is based on the assumption that “observed F0 contours are not linguistic units per se. Rather, they are the surface realizations of linguistically functional units such as tones or pitch accents.”[12] To be able to handle the input context information, we also separate them into two parts, one part is traditionally used for normal speech synthesis, the other part is a kind of emotional prosody related information, which normally can only be marked manually. Experiments show that, with the CART model, the traditional context information is able to generate good prosody outputs for some emotion states, while results could be improved if more rich information, such as stresses, breaks and jitter information, are integrated into the context information. Listening tests also show that there are still some distance between output emotional speech and original one, due to the lack of voice quality control which will be solved in our further research. The paper is composed of five major parts. Section 2 introduces the corpus with emotion labeling. The acoustic features characteristic of emotions were also analyzed. In this section the paper also describes the traditional linear modification model which uses prosody patterns from the acoustic mapping results directly. Further analysis on emotion and prosody reveals that emotions are closely related to subtle prosody distributions, such as stress, rhythm and paralinguistic information. Section 3 describes the CART model which is used to convert the prosody features from “neutral” to emotional speech. The pitch target model is used as the output parameter in this section. In section 4, the paper provides more discussion on the method introduced in the paper via experiments. Section 5 provides a conclusion of the work.

2.

Corpus, Analysis and Annotation

We use 2000 sentences of spontaneous dialog speech of one speaker, which were collected via a call center system in daily life, for our work. Each emotion state (“fear”, “sadness”, “anger” or “happiness”) contains 500 sentences. The both linguistic information and paralinguistic information are well reserved in the speech. Each sentence in our database contains at least 2 phrases. There were 1,201 phrases, and 7,656 syllables in total, so on average each utterance contained 2 phrases. After the collection, all of collected sentences were also read by a professional speaker with a “neutral” way. The recorded speech is used as the reference of the research between emotional states and “neutral” state. Utterances were then segmentally and prosodically annotated with pitch marks and phoneme (initial and final) boundaries. The emotional speech differs from the “neutral” speech in various aspects, including intonation, speaking rate and intensities, etc. The distribution of prosodic parameters in different emotions of the corpus are shown in table 1.

Table 1. The distribution of prosodic parameters in different emotions F 0 mean (Hz)

Neutral 135

Fear 119

Sadness 108

Anger 152

Happiness 168

F0bottom (Hz)

86

81

83

95

109

F 0 top (Hz)

181

165

141

256

238

Dsyllable (ms)

169

173

198

162

178

E

65

61

61

76

72

(DB)

Here, the values indicate the means of F0 mean ( F 0 mean ), F0 topline ( F 0 top ), F0 baseline ( F0bottom ), syllabic duration ( Dsyllable ) and intensity ( E ). The table partly confirms the previous research[1] that “happiness” and “anger” yield a high F0, while “sadness” generate lower F0 than “neutral”, and “fear” is quite close to “sadness”. The overlap of F0 mean and F0 topline in different emotions is less than that of F0 baseline. It seems that the F0 mean and topline provide better “resolving power” for perception than the F0 baseline.

In general, the choice of contour would then be more related to the type of sentence, while the pitch level and excursion size of the pitch movements would be more related to the speaker’s emotional state. Among all traditional emotional prosody generation methods, linear modification method(LMM) seems to be the most intuitive. The model can be described as follows,

y n ,i = α n ,i ⋅ x

(1)

x indicates the input prosodic parameters, F0 topline, F0 baseline, F0 mean, syllabic duration and intensity. y denotes their outputs among different emotions. α is the transform scale of the parallel prosodic parameters between “neutral” and emotions as calculated from the training set of the corpus. ‘n’ denotes the emotional state, i.e. “fear”, “sadness”, “anger” and “happiness”, i indexes the emotion level, i.e. “strong”, “normal” and “weak”. 2.1 Emotion and Jitter Someone also point out that F0 jitter was an important parameters for emotional speech [1]. For F0 jitter, normally, a quadratic curve was fitted to a running window of 5 successive F0 values on the F0 contour and then subtracted from that section of the F0 contour. It was calculated as the mean period to period variation in the residual F0 values. Table 1 shows the results from emotions in our corpus. Table 2. The average results of F0 jitter of emotions Emotions F0 jitter (HZ)

Fear 6.2

Sad 5.9

Angry 8.5

Happy 12.6

With the results, we can see that “happiness” has the highest F0 jitter while “neutral” contains the minimum value. During speech synthesis, F0 jitter is realized by a random variation in the length of the pitch periods with an amplitude in accordance to the parameters value. This random variation is controlled by a white noise signal filtered by a one pole low pass filter.

2.2 Emotion and Stresses There also exists a strong relationship between emotions and stresses [13]. Stress refers to the most prominent element perceived in an utterance rather than literal “semantic focus” which is used to express speaker’s attitudes. In principle, the changing of stresses from “neutral” speech to “emotional” speech could be summarized into five types, decreasing, weakening/disappearing, boosting, increasing and shifting. Decreasing means the amount of stresses is decreased from “neutral” speech to “emotional” speech. Weakening/Disappearing means all of stresses are weakened, some of them are even lost. Boosting denotes the intensity of stresses is amplified. Increasing means the amount of stresses is decreased. Shifting represents the stress locations are changed among emotions. Some emotions might have more than one stress changing feature. For instance, in “sad” speech, stresses might disappear while “happy” voice may both increase the number of stresses and magnify them. “anger” may both does the stress shifting and also amplifies the stress located on emotional functional words.

2.3 Emotion and Breaks Compared with the expression of stresses in different emotions, changing features of prosodic rhythms are not very clear to some extent, but there are still some points need to be noted. In neural speech, the most obvious phenomenon about prosodic tempo is that the pitch value becomes higher in the beginning of prosodic phrase and become lower in the ending of phrase. But in emotional speech, being influenced by emotional functional words, which are defined as the focus of emotion, the rule is sometimes broke up. The pitch values of key words always become very high or very low according to different emotions. Due to the impact of speaking rate in different emotions, the amount of prosodic breaks may decrease with fast speaking rate, such as “angry” and “happy”, while they may increase with slower speaking rate, such as “sadness”.

2.4 Emotion and Paralinguistic Information Although the prosodic function of conveying the expression of emotion seems to involve both a linguistic and a paralinguistic component [14], paralinguistic information normally does more influence on emotion expression. Distinguishing the contour type from its detailed implementation in terms of pitch level and pitch range may well lead to a distinction between linguistic and paralinguistic value of the intonation variations. This expectation is related to the general assumption of the linguistic value of contour type, and the paralinguistic function of its concrete phonetic realization, such as “grunts”, “breathing, etc. Though the prosody is influenced by paralinguistic information, actually, relations between them are very complicated and far from being discovered. Thus, in our current work, we didn’t use them for the prosody model, however we labeled them in the corpus for the further research.

2.5 Multilayer Annotation We try to label the phenomena which is related to linguistic features, utterance expression and emotions, non-linguistic features, etc. as much as possible, however not all of them could be

directly integrated into the system in the meantime. The labelled information is seperated into the different layers. Transcription Layer The collected speech was transcribed orthographically with normalized text expression. Pronunciation Layer It records the pinyin information of the speech. Initials and finals are also listed. Pitch Layer The layer annotates the detailed pitch marks of the voice. Segmentation Layer The layer marks syllable or silence/pause boundaries of each utterance. Initial and final boundaries are also labelled. Break Layer In our work, we have four types of prosodic boundaries. They are, z Break0: syllabic boundary. z Break1: prosodic word boundary, a group of syllables that are uttered closely. z Break2: prosodic phrase boundary, a group of prosodic words that has a perceptive rhythm break at the end. z Break3: sentence boundary, the utterance for a whole speech. Stress Layer Here, there are three types of stresses, intonation stress, phrasal stress and (prosodic) word stress. Paralinguistic Information Layer The paralingual and non-lingual phenomenon included in labels are as follows: beep, breathing, crying, coughing, deglutition, hawk, interjection, laughing, lengthening, murmur, noise, overlap, smack, sniffle, sneezes, yawn, etc. To be able to use the corpus for further research, we also used the layers in a number of previous schemes. (Core and Allen, 1997; Di Eugenio et al., 1998; Traum, 1996; Walker et al., 1996; MacWhinney, 1996; Jekat et al., 1995; Anderson et al., 1991; Condon and Cech, 1996; van Vark et al., 1996; Walker and Passonneau, 2001). Layers from different schemes are grouped according to the similar phenomena that they label. They are, Speech acts All of the schemes that we examined annotated the utterances for their illocutionary force. Since this is the layer that contains most information regarding the semantic content of an utterance, this is likely to be where we shall find the most interesting correlations. Communications status Communications status indicates whether an utterance was successfully completed. It is used to tag utterances that are abandoned or unintelligible rather than whether the intention of a speech act was achieved. Topic Several annotation schemes contain this layer that labels the topic discussed in an utterance. This is usually in task domains where there is a finite number of subjects that will be discussed. Phases Some schemes distinguish between dialogue phases such as opening, negotiation and query. Emotion in dialogue also goes through phases and it is possible that there are boundaries between the phases of emotion that correspond to those tagged using the phase layer. Surface form Surface form tagging is used in David Traum’s adaptation of the TRAINS annotation scheme (Traum, 1996) and the Coconut scheme to tag utterances for certain special features such as cue words or negation. It has been shown that certain syntactic features of an utterance may be indicators of emotion.

3.

CART Model based Prosody Generation

To be able to handle the context information, we propose a Classification and Regression Trees (CART) which have been successfully used in prosody prediction. The model could do the prosody mapping from “neutral” speech to “emotional” speech with various context informations. The framework of the model is shown in Fig. 1. “neutral” prosody parameters

“emotional” prosody parameters

Difference between “neutral” and “emotional” prosody parameters

CART

Context information Fig. 1. The framework of CART based emotional prosody conversion In the model, the input context information is classified into two parts, the context part I is normally used for traditional speech synthesis. It contains, z Tone Identity (including current, previous and following tones, with 5 categories). z Initial Identity (including current and following syllables' initial types, with 8 categories). z Final Identity (including current and previous syllables' final types, with 4 categories). z Position in sentence (including Syllable position in word, word position in phrase and phrase location in sentence) z Number (including syllable number of the prosodic word, word number of the phrase, and phrase number of the sentence) z Part of speech (including current, previous and following words, with 30 categories) The context part II contains: z Break types (including intonation phrase boundaries, prosodic phrase boundaries and prosodic word boundaries). z Stress type (including intentional stress and phrasal stress). F0 jitter degree (denote how serious of the F0 jitter in emotional speech). z Since there are lots of changes between “neutral” speech and “emotional” speech in the part II, the information is normally not predicted by text analysis module, but labeled in the input text by markup languages. The output parameters of the model are the differences of “neutral” and “emotional” prosodic parameters. As we know, Mandarin is a typical tonal language, in which a syllable with different tone types can represent different morphemes. Several models have been proposed to describe F0 contours before, such as the Fujisaki model [15], The Soft Template Mark-Up Language (Stem-ML) model[16], the pitch target model[12] and Title model, etc. In the pitch target model, variations in surface F0 contours result not only from the underlying pitch units (syllables for Mandarin), but also from the articulatory constraints. Pitch targets are defined as the smallest operable units associated with linguistically functional pitch units, and

these targets may be static (e.g. a register specification, [high] or [low]) or dynamic (e.g. a movement specification, [rise] or [fall]). With these features, we believe the pitch target model are quite suitable for prosody conversion. The output parameters are, then, the differences of pitch target parameters a, b, β and λ between “neutral” and “emotional” parameters. Let the syllable boundary be [0, D] . The pitch target model uses the following equations [17].

T (t ) = at + b

(2)

y (t ) = β exp(−λ t ) + at + b

(3)

0 ≤ t ≤ D, λ ≥ 0 Where T (t ) is the underlying pitch target, and y (t ) is the surface F0 contour. The parameters a and b are the slope and intercept of the underlying pitch target respectively. These two parameters describe an intended intonational goal of the speaker, which can be very different from the surface F0 contour. The coefficient β is a parameter measuring the distance between the F0 contour and the underlying pitch target at t=0. λ describes how fast the underlying pitch target is approached. The greater the value of λ is, the faster the speed. A pitch target model of one syllable can be represented by a set of parameters (a,b,β,λ). As described in [17], (a,b,β,λ) can be estimated by nonlinear regression process with expected-value parameters at initial and middle points of each syllable’s F0 contour. The Levenberg-Marquardt algorithm [17] is used for estimation as a nonlinear regression process. Wagon toolkit [19], with full CART function, was used in our work. Source and target pitch contours from parallel corpus are aligned according to labelled syllable boundaries, and then pitch target parameters are extracted from each syllable’s pitch contour, finally mapping functions of parameters a, b, β and λ are estimated using the CART regression. There were totally four CART models trained with different “neutral” and emotion mappings. For conversion, the pitch target parameters estimated from source pitch contours are transformed by the mapping functions obtained in the training procedure and then the converted pitch target parameters generate new pitch contours associated with the target characteristics.

4.

Experiments and Discussion

We used the STRAIGHT[18] model as an acoustic model to generate the emotional speech output with the above CART based prosody model. Here, we didn’t do the specific voice quality control in the acoustic level. A prosody converting example is given in Fig. 2.

Fig. 2. An example of F0 conversion using the pitch target model in “neutral” to “happiness” conversion with both the context part I and the context part II. Eight listeners were asked to give a subjective evaluation on these test sentences. Two methods are conducted to evaluate the proposed emotional conversion: • ABX test: ABX test in evaluating voice conversion is used in the evaluation. all listeners are required to judge whether a converted speech X sounded closer to a source neutral speech A or a target emotional speech B. This test confirms whether the conversion system is successful. • EVA test: Only converted speeches are listened to and then the associated emotional state is given by these listeners. This test confirms whether the emotional conversion is successful. Results of the evaluation are shown in Fig.3 and Fig.4, in which X axis is the emotional state and Y axis is the mean correct rate (it is the ratio of judging X as B in ABX test, and considering the converted speech as the corresponding emotional speech in EVA test) of all listeners. ABX test has proved that the converted emotional speech possesses the corresponding emotional state compared with the source speech. Because, in ABX tests, conners can compare the converted speech with the source “neutral” speech, results of ABX tests are better than those of EVA tests. There are differences among emotional conversions in these perception tests, in which the “neutral-sadness” and “neutral-fear” conversions are respectively best and worst. With only context part I, both “fear” and “happiness” are very hard to be simulated, while “sadness” are a little bit easier, since “sadness” is normally related to general prosodic features, such as narrow pitch range, low pitch level, and slow speed. From Fig.5, it shows that the emotional prosody output will be better if we use all of the context information than that with only the context part I, while more detailed control of prosody information is integrated. From all of results, we can find none of them get full score in the listening test. Part of reasons might be the lack of voice quality control. The size of the corpus might be another problem for that. Further work will be based on our new collected large corpus (with 2000 sentences for each emotion). More detailed acoustic analysis and voice quality control will also be considered.

ABX

sadness Emotions

EVA anger happiness Fear 0

0.2

0.4

0.6

0.8

1

Correct Rate

Fig. 3. Results of expressive converting evaluations with the context part I

ABX

sadness Emotions

EVA anger happiness fear 0

0.5

1

Correct Rate

Fig. 4. Results of expressive converting evaluations with both the context part I and context part II

5.

Conclusion

When generating expressive speech synthesis, we are easily tempted to fall into the practice of using the acoustic patterns driven by the speech with emotion state with a linear modification approach. However, without a more detailed distribution of these acoustic patterns, it is hard for us to synthesize more expressive or less expressive speech. To solve this problem, the paper proposed using a CART method. Unlike the linear modification method, the CART model efficiently maps the subtle prosody distributions between neutral and emotional speech, and allows us to integrate linguistic features into the mapping. A pitch target model which was designed to describe Mandarin F0 contours was also introduced. The experiment results prove that the CART method gives us the very good emotional speech output. The results also show that, with the CART model, the traditional context information which is used for normal speech sythesis, is able to generate good prosody outputs for some emotion states, such as “sadness”, while results could be much improved if more rich information, such as stresses, breaks and

jitter information, are integrated into the context information. The methods discussed in the paper provide ways to generate emotional speech in speech synthesis, however there is still lots of work to be done in future.

References 1.

2.

3. 4. 5. 6. 7. 8. 9.

10.

11. 12. 13. 14. 15. 16. 17. 18.

19. 20.

Murray, I. and Arnott, J. L., “Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” in Journal of the Acoustic Society of America, 1993, pp.1097-1108. Stibbard, R. M., “Vocal Expression of Emotions in Non-laboratory Speech: An Investigation of the Reading/Leeds Emotion in Speech Project Annotation Data”. PhD Thesis. University of Reading, UK. 2001. McGilloway, S.; Cowie, R.; Doulas-Cowie, E.; Gielen, S.; Westerdijk, M.; Stroeve S.: Approaching Automatic Recognition of Emotion from Voice: A Rough Benchmark. 2000. Amir, N., “Classifying emotions in speech: a comparison of methods”. Holon Academic Institute of technology, EUROSPEECH 2001, Escandinavia. Sylvie J.L. Mozziconacci and Dik J. Hermes, “Expression of emotion and attitude through temporal speech variations”, ICSLP2000, Beijing, 2000. J.E. Cahn, “The generation of affect in synthesized speech”, Journal of the American Voice I/O Society, vol. 8, July 1990. Nick Campbell, “Synthesis Units for Conversational Speech - Using Phrasal Segments”, http://feast.atr.jp/nick/refs.html M. Schröder & S. Breuer. XML “Representation Languages as a Way of Interconnecting TTS Modules”. Proc. ICSLP Jeju, Korea, 2004 E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and J. Pitrelli, “A corpus-based approach to expressive speech synthesis”, IEEE speech synthesis workshop, 2002, Santa Monica Ze-Jing Chuang and Chung-Hsien Wu “Emotion Recognition from Textual Input using an Emotional Semantic Network,” In Proceedings of International Conference on Spoken Language Processing, ICSLP 2002, Denver, 2002. Jianhua Tao, “Emotion control of Chinese speech synthesis in natural environment,” in EUROSPEECH- 2003, pp. 2349–2352. Yi Xu and Q. Emily wang, “Pitch targets and their realization: Evidence from mandarin chinese,” Speech Communication, vol. 33, pp. 319–337, 2001. Aijun Li and Haibo Wang, “Friendly Speech Analysis and Perception in Standard Chinese”, ICSLP2004, Kerea, 2004. Laver,J., “The phonetic description of paralinguistic phenomena”, the XIIIth International Congress on Phonetic Sciences. Stockholm, Sweden, Supplement, 1-4, 1995 H. Fujisaki and K. Hirose. Analysis of voice fundamental frequency contours for declarative sentence of Japanese. J. Acoust. Soc. Jpn. (E) 5(4):233–242, 1984. Kochanski, G. P. and Shih, C. “Stem-ML: Language independent prosody description”, the 6th International Conference on Spoken Language Processing, Beijing, China. Xuejing Sun, The Determination, Analysis, and Synthesis of Fundamental Frequency, Ph.D. thesis, Northwest University, 2002. Hideki Kawahra, Reiko Akahane-Yamada, “Perceptual Effects of Spectral Envelope and F0 Manipulations Using STRAIGHT Method”, J. Acoust. Soc. Am., Vol.103, No.5, Pt.2, 1aSC27, p.2776 (1998.5) http://festvox.org/docs/speech_tools-1.2.0/x3475.htm Nick Campbell, “Getting to the Heart of the Matter; Speech is more than just the Expression of Text or Language”, LREC, 2001