Validation of an Expressive Speech Corpus by Mapping ... - La Salle

In F. Sandoval et al. (Eds.): Computational and Ambient Intelligence, 9th International Work-Conference on Artificial Neural Networks, IWANN 2007, San Sebastián, Spain, June 2007, Proceedings. Vol. 4507 of Lecture Notes in Computer Science. Springer. pp. 646--653. DOI: 10.1007/978-3-540-73007-1_78 The original publication is available at www.springerlink.com

Validation of an Expressive Speech Corpus by Mapping Automatic Classification to Subjective Evaluation⋆ Ignasi Iriondo, Santiago Planet, Francesc Al´ıas Joan-Claudi Socor´ o, Elisa Mart´ınez Department of Communications and Signal Theory Enginyeria i Arquitectura La Salle, Ramon Llull University Pg. Bonanova 8, 08022 Barcelona, Spain {iriondo, splanet, falias, jclaudi, elisa}@salle.url.edu

Abstract. This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis. The use of acted speech can be rather lacking in authenticity and therefore its expressiveness validation is required. The goal is to obtain an automatic classifier able to prune the bad utterances –with wrong expressiveness–. Firstly, a subjective test has been conducted with almost ten percent of the corpus utterances. Secondly, objective techniques have been carried out by means of automatic identification of emotions using different algorithms applied to statistical features computed over the speech prosody. The relationship between both evaluations is achieved by an attribute selection process guided by a metric that measures the matching between the misclassified utterances by the users and the automatic process. The experiments show that this approach can be useful to provide a subset of utterances with poor or wrong expressive content.

1

Introduction

There is a growing tendency towards the use of speech in human-machine interaction. The incorporation of the recognition of emotional states or the synthesis of emotional speech can improve the communication by doing it more natural [2]. Therefore, one of the most important challenges in the study of the expressive speech is the development of oral corpora with authentic emotional content. The naturalness of the locutions will depend on the strategy followed to obtain the emotional speech. The main debate is centred on the compromise between authenticity and the degree of control on the recording. Campbell [2] and later Schr¨oder [12] propose four emotional speech sources: i) Natural occurrences obtained from spontaneous human interaction, which present the most natural emotional speech although this approach has some drawbacks due to the lack of control on its content, i.e. the quality of sound and the difficulty of labelling ⋆

This work has been partially supported by the European Commission, project SALERO FP6 IST-4-027122-IP.

[6]; ii) The provocation of authentic emotions (elicitation) in people in the laboratory is a way of compensating some of the problems described previously, although the fullblown emotions would remain out of place [2, 12]; iii) Stimulated emotional speech by means of reading texts with verbal content adapted for the target emotion. The difficulty of comparing utterances with different texts should be counteracted with an increase of the corpus size so that statistical methods allow to generalize models; and iv) Acted emotional speech by reading the same sentences with different emotions, which provides direct comparisons of phonetics, prosody and voice quality. However, the greatest objection is the lack of authenticity in the expressed emotion [2]. Another important aspect to keep in mind is the purpose of the speech and emotion research. It is necessary to distinguish between processes of perception (centred on the speaker ) or expression (centred on the listener ) [12]. The objective of the former is to establish the relationship between the speaker emotional state and quantifiable parameters of speech. Usually, they deal with the recognition of emotions from speech signal. According to [5], one of the challenges is the identification of oral indicators attributable to the emotional behavior which are not simply own characteristics of conversational speech. The latter shape the parameters of the speech with the goal to transmit a certain emotional state. The databases used for conducting emotional speech synthesis usually are based on acted speech [6] where a professional speaker reads a set of texts (with or without emotional content) simulating the emotions to be produced. This work combines methods of both types of studies. On the one hand, the production of the corpus follows the guidelines of the studies centred on the listener since it is oriented to speech synthesis. On the other hand, we apply techniques of emotion recognition in order to validate its expressive content. It is not the objective of the present work to carry out an exhaustive summary of the available databases for the study of emotional speech, since recently, complete studies have appeared in the literature [5, 3, 14]. This paper describes the main aspects of the production of an expressive speech corpus in Spanish to be used for synthesis purposes. The main contribution is the mapping between objective and subjective evaluation of the emotional content in order to prune the bad utterances (i.e with distinct emotion/style to the tagging) of the database automatically. Section 2 introduces different remarks about the production of expressive speech databases. Section 3 presents the previous subjective test and the guidelines of the used automatic classification. Section 4 details the process of validation based on supervised classification guided by the subjective listening test results. Finally, the conclusions and the future work are described in Section 5.

2 2.1

Our expressive speech database in Spanish Corpus production

We have developed a new expressive oral corpus for Spanish oriented to expressive speech synthesis with a twofold purpose: firstly, it will be used in the acoustic

modeling (prosody and voice quality) of the emotional speech, and secondly, it will be the speech unit database for the synthesizer. For the recording of the present corpus, a female professional speaker has been chosen due to her capability to use the suitable expressive style to each text subgroup. For the design of texts semantically related to different expressive styles, we have made use of an existing textual database of advertisements extracted from newspapers and magazines. Based on a study of the voice in the audio-visual publicity [9], five categories of the textual corpus have been chosen and the most suitable emotion/style has been assigned to them: new technologies (neutral-mature), education (joy-elation), cosmetic (style sensual-sweet), automobiles (aggressive-hard) and trips (sad-melancholic). The recorded database has 4638 phrases and it is 5 h 12 min long. From each category, a set of phrases has been selected by means of a greedy algorithm [7] that has allowed to select phonetically balanced sentences from each subcorpus. In addition to looking for a phonetic balance, phrases that contain exceptions (foreign words, abbreviations) have been discarded because they make difficult the automatic processes of phonetic transcription and labelling. Moreover, the selection of similar phrases to others previously selected has been penalized by the greedy algorithm. In order to achieve the corpus segmentation in phrases, a semiautomatic process has been conducted followed by a manual review and correction. The semiautomatic process is based on a forced alignment with Hidden Markov Models from the phonetic transcription. This forced alignment has also been used to segment the phrases into phonemes.

2.2

Acoustic analysis

Prosodic features of speech (fundamental frequency, energy, duration of phones and frequency of pauses) are related to vocal expression of emotion [4]. In this work, an automatic acoustic analysis of the sentences is performed using information of the previous phonetic segmentation. The analysis of the fundamental frequency (F0) parameters is based on the result of the pitch marker described in [1]. This system has the particularity of assigning marks over the whole signal. The unvoiced segments and silences are marked using interpolated values from the neighboring voiced segments. For energy, speech is processed with 20 ms rectangular windows and 50% of overlap, calculating the mean energy in decibels (dB) every 10ms. Although some studies omit rhythm parameters due to the difficulty of obtaining them automatically, we have incorporated this information (thanks to the labelling of the corpus) using the z-score as a means to analyze the temporal structure of speech [13]. Moreover, two pausing related parameters are added for each utterance: the number of pauses per time unit and the percentage of silence respect to the total time.

3

Expressiveness validation

In this section, the approach to validate the expressiveness of the corpus utterances is presented. Firstly, the performed subjective test is summarized. Secondly, we explain the mapping of automatic identification of emotions to the listening test results. And finally, a method to improve the corpus by means of pruning the bad utterances (from an expressiveness point of view) is presented. These bad utterances could cause drawbacks in the following steps related to expressive speech synthesis (i.e. acoustic modelling of expressive speech and synthetic signal generation). 3.1

Subjective test

A subjective evaluation is a tool that allows to validate the expressiveness of acted speech from a user’s point of view. An exhaustive evaluation of the whole corpus would be excessively tedious (the corpus has 4638 utterances which should be validated). For each style, 96 utterances have been chosen. A forced answer test has been designed with the question: What emotional state do you recognize from the voice of the speaker in this phrase? The possible answers are the 5 styles of the corpus plus one more option Don’t know / Another (Dk/A) in order to avoid biasing the results due to confusing cases. Adding this option has the risk that some evaluators may use excessively this answer to accelerate the end of the test [10]. However, this effect has not been observed in this test. The evaluators were 30 volunteers of Enginyeria i Arquitectura La Salle with a quite heterogeneous profile. The subjective test achieved an average classification accuracy of 87%. The confusion matrix (table 1) shows that sad style (SAD) was the best rated (98.5% in average), followed by sensual (SEN) (87.2%) and neutral (NEU) (86.1%), and finally happy (HAP) (81.9%) and aggressive (AGR) (81.6%). The main confussion was between both AGR and HAP. Moreover, SEN was slightly misclassified as SAD or NEU. The Dk/A option was hardly used, although it was more present in NEU and SEN than in the rest of styles. From the subjective test result (see Figure 1), two simple rules have been established in order to decide if one utterance was not optimally performed by the professional speaker. These rules eliminate the extreme and ambiguous cases showed in both histograms by the manual tuning of two thresholds. Utterances with identification percentage lower than 50% or Dk/A percentage larger than Table 1. Average confusion matrix for the subjective test. Answers → AGR HAP NEU SEN SAD AGR 81.6% 15.5% 1.6% 0.1% 0.1% HAP 15.1% 81.9% 1.6% 0.2% 0.1% NEU 5.7% 1.5% 86.1% 3.4% 0.8% SEN 0.0% 0.1% 4.2% 87.2% 6.0% SAD 0.0% 0.0% 0.4% 1.0% 98.5%

Dk/A 1.2% 1.2% 2.5% 2.6% 0.1%

Frequency

Frequency

180 160 140 120 100 80 60 40 20 0 and 25% 35% 45% 55% 65% 75% 85% 95% lower

% Correct class identification

450 400 350 300 250 200 150 100 50 0 2%

4%

6%

8%

10% 12% 14% 16% and higher

% "Don't know / Another"

Fig. 1. Histograms for the correct identification and ’Don’t know / Another’ percentages in the subjective test

12% can be understood as outiliers according to the histograms. In the test subset, there are 33 utterances out of 480 (96 per style) that satisfy at least one rule. This fact means a reduction of 6.88% due to lack of the adequate expressiveness. 3.2

Statistical analysis and datasets

In [8], we performed an experiment of automatic emotion identification covering different datasets of statistical prosodic features and algorithms. Initially, we began with a dataset with 464 attributes per utterance, which was divided into different subsets according to some strategies to reduce its dimensionality. The conducted experiments showed almost the same results between the full set of attributes and a dataset reduced to 68 parameters. Therefore, we have chosen this dataset for this work. In this dataset, the prosody of an utterance is represented by the vectors of logF0, energy in dB and normalized durations (z-score). For each sequence, the first derivative is calculated. For all these resulting sequences, the following statistics are obtained: mean, variance, maximum, minimum, range, skew, kurtosis, quartiles, and interquartilic range. Thus, 68 parameters by utterance are calculated, considering both parameters related to the pausing, previously described in section 2.2. 3.3

Supervised classification

Numerous schemes of automatic learning can be used in a task such as classifying the style/emotion from the acoustic analysis of the speech. The objective evaluation of expressiveness in our speech corpus is based on [11], where a large-scale data mining experiment based on the automatic recognition of basic emotions in short utterances was conducted. In [8], twelve machine learning algorithms were tested with different datasets. All the experiments were carried out using Weka software [15] by means of ten-fold cross-validation. A very high identification rate was achieved. For instance, in average with all the datasets, SMO (Support Vector Machine of Weka) obtained the best results (∼ 97%) followed by Na¨ıve-Bayes (NB) with 94.6% and J48 (Weka Decision Tree based on C4.5)

with 93.5%. With these results, we could conclude that, in general, the styles of the developed acted speech corpus can be clearly differentiated. Moreover, the results of the subjective test showed a good credibility of the expressive speech content. However, we considered necessary to go one step further by developing a method to validate each utterance of the corpus following subjective criteria and not only the automatic classification from the space of attributes.

4

Subjective-based attribute selection

At that point, the goal would be to find the optimum classifier (algorithm and set of attributes) able to follow the subjective criteria observed previously. Our approach consists on the development of an attribute selection method guided by the results obtained in the previous subjective listening test. Once different selection strategies and algorithms have been tested, the best combination will be applied to the whole corpus in order to generate a list of candidate utterances to be pruned. Two methods for attribute selection have been developed in order to determine the subset of attributes that better maps the subjective test results. As previously mentioned, the original set has 68 attributes per utterance, so an exhaustive search of subsets is impractical, thus greedy search procedures will be used, which are guaranteed to find a locally optimal set of attributes [15]. On the one hand, we have chosen the Forward Selection (FW) process, which starts without any attribute and add them one at a time and, on the other hand, the Backward Elimination (BW) technique, which starts from the full set and delete attributes one at a time. For each iteration, the classifier is tested with the 480 utterances described in section 3.1 and previously trained with the 4158 remaining utterances. The test utterances are classified according to one of the 5 predefined styles. The wrong classified cases take part on the evaluation process of the involved subset of attributes. The novelty of this process is to use a subjective-based measure to evaluate the expected performance of the subset in each iteration. The used measure is the F1 score computed from the precision and the recall of the wrong classified utterances compared with the 33 utterances rejected by the subjective test. 4.1

Results

The process has been repeted six times –three classification algorithms (SMO, NB and J48) and two attribute selection techniques (FW and BW)–. Figure 2 shows the evolution of the maximum of F1 measure according to the best subset of attributes in each iteration. SMO algorithm obtain F 1 = 0.50 in both FW and BW attributte selection techniques. The results for both NB are also similar between them (F 1 ∼ 0.42). The main difference is for J48 that obtains better F1 with FW than with BW process. Moreover, SMO-FW seems to be the most stable configuration as it achieves the same result with a broad range of attribute subsets (see Table 2). On the right column, there is the number of utterances rejected for both automatic and subjective evaluation (Coincident )

0.5 0.45

0.4

0.4

0.35

0.35

F1

F1

0.5 0.45

0.3 0.25

0.3 0.25

SMO_FW 0.2

SMO_BW 0.2

NB_FW

NB_BW

J48_FW

J48_BW

0.15 0.1

0.15

0

10

20

30

40

50

60

70

Number of attributes

(a)

0.1

0

10

20

30

40

50

60

70

Number of attributes

(b)

Fig. 2. Maximum F1 values per number of attributes according to SMO, Na¨ıve-Bayes and J48 algorithms with forward selection (a) and backward elimination (b) strategies. Table 2. Results for the best configuration of each algorithm (SMO, NB, J48) / strategy (FW, BW). Algorithm/Strategy Number of attributes Max F1 Coincident / misclassified SMO / FW 18-35 0.50 14/24 SMO / BW 15-16 0.50 15/27 NB / FW 43-44 0.42 16/38 NB / BW 47-49 0.43 13/23 J48 / FW 18 0.45 18/51 J48 / BW 17-20 0.37 10/22

respect to the total number of misclassified ones automatically. J48-FW is the configuration with the highest number of coincidences (18) but it also shows an excessive number of general misclassifications (51).

5

Conclusions and future work

This paper has explained an experiment oriented to the expressive validation of an acted speech corpus. This corpus is too large to be checked exhaustively by several evaluators in a listening test. We have performed a subjective test with 30 subjects and ten percent of the utterances approximately. The result of this test has shown that some utterances are perceived with a different or poor expressive content with respect to their tagging. We have proposed an automatic classification that tries to learn from the result of the subjective test in order to generalize the solution to the rest of the corpus. In the next step, the described automatic process should be conducted over the full corpus. For instance, a ten-fold cross validation would be a good technique to cover the whole corpus. The misclassified utterances would be candidate to be pruned. A first approach would consist on running different classifiers and

selecting the final candidates by a stacking technique [15]. Also, we will be able to evaluate the suitability of the proposed method by performing acoustic modelling of the resulting emotional speech after pruning with respect to the results obtained with the whole corpus. A lower error on the prosodic estimation could confirm the hypothesis that it is advisable to eliminate the bad utterances of the corpus previously.

References 1. Al´ıas, F., Monzo, C., Socor´ o, J.C.: A pitch marks filtering algorithm based on restricted dynamic programming. (In: InterSpeech2006 -International Conference on Spoken Language Processing (ICSLP)) 1698–1701 2. Campbell, N.: Databases of emotional speech. Proceedings of the ISCA Workshop on Speech and Emotion (2000) 34–38 3. Cowie, R., Douglas-Cowie, E., Cox, C.: Beyond emotion archetypes: databases for emotion modelling using neural networks. Neural Networks 18 (2005) 371–388 4. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human computer interaction. IEEE Signal Processing 18 (2001) 33–80 5. Devillers, L., Vidrascu, L., Lamel, L.: Challenges in real-life emotion annotation and machine learning based detection. Neural Networks 18 (2005) 407–422 6. Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P.: Emotional speech: towards a new generation of databases. Speech Communication 40 (2003) 33–60 7. Fran¸cois, H., Bo¨effard, O.: The greedy algorithm and its application to the construction of a continuous speech database. In: Proceedings of LREC. Volume 5. (2002) 1420–1426 8. Iriondo, I., Planet, S., Socor´ o, J., Al´ıas, F.: Objective and subjective evaluation of an expressive speech corpus. In: ITRW on Non Linear Speech Processing (NOLISP), Paris, France (2007) 9. Montoya, N.: El papel de la voz en la publicidad audiovisual dirigida a los ni˜ nos. Zer. Revista de estudios de comunicaci´ on (1998) 161–177 (in Spanish). 10. Navas, E., Hern´ aez, I., Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE Trans. on Audio, Speech and Language Processing 14 (2006) 11. Oudeyer, P.Y.: The production and recognition of emotions in speech: features and algorithms. International Journal of Human Computer Interaction 59 (2003) 157–183 special issue on Affective Computing. 12. Schr¨ oder, M.: Speech and Emotion Research: An overview of research frameworks and a dimensional approach to emotional speech synthesis. PhD thesis, PHONUS 7, Research Report of the Institute of Phonetics, Saarland University (2004) 13. Schweitzer, A., M¨ obius, B.: On the structure of internal prosodic models. In: Proceedings of the 15th ICPhS (Barcelona). (2003) 1301–1304 14. Ververidis, D., Kotropoulos, C.: Emotional speech recognition: Resources, features, and methods. Speech Communication 48 (2006) 1162–1181 15. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. 2 edn. Morgan Kaufmann, San Francisco (2005)