SPEECH EMOTION RECOGNITION USING AN ENHANCED CO-TRAINING ALGORITHM Jia Liu, Chun Chen, Jiajun Bu, Mingyu You
Jianhua Tao
College of Computer Science Zhejiang University Hangzhou, P.R. China, 310027 {liujia, chenc, bjj, roseyoumy}@zju.edu.cn
National Lab of Pattern Recognition Chinese Academy of Sciences Beijing, P.R. China, 100080
[email protected]
ABSTRACT
ily available and adequate, so a methodology called ’semisupervised learning’ has been developed due to its potential in reducing the need for expensive labeled data. In the area of classification, many semi-supervised learning algorithms have been proposed, including EM with generative mixture model [5], co-training [6, 7] and a variety of graph-based methods [8, 9]. Much of the work has focus on text classification [5, 8] and web page categorization [6, 9]. However, there are few studies related to the semisupervised learning on speech emotion recognition. It was reported that Maeireizo et al. [7] utilized unlabeled examples with a co-training method to perform a binary ’Emotional or Non-emotional’ classification of dialogues. The co-training prototype proposed by Blum and Mitchell [6] is a prominent achievement in semi-supervised learning. It initially defines two classifiers on distinct attribute views of few labeled data on hand. Either of the views is required to be conditionally independent to the other and sufficient for learning a classification system well. Then iteratively, each classifier’s predictions on unlabeled examples are selected to augment the training data set. This co-training algorithm and its variations [10] have been applied in many application areas because of their theoretical justifications and experiential success. In this paper, we concentrate on six basic emotions and propose a speech emotion recognition system using an enhanced co-training algorithm. The next section describes temporal features, statistic features and their corresponding classifiers which are utilized in the system. Section 3 presents the details of our enhanced co-training algorithm. Then, a brief description of the emotion corpus in our experiments and the performance comparison of three methods are given in Section 4. Section 5 concludes this paper.
In previous systems of speech emotion recognition, supervised learning are frequently employed to train classifiers on lots of labeled examples. However, the labeling of abundant data requires much time and many human efforts. This paper presents an enhanced co-training algorithm to utilize a large amount of unlabeled speech utterances for building a semisupervised learning system. It uses two conditionally independent attribute views(i.e. temporal features and statistic features) of unlabeled examples to augment a much smaller set of labeled examples. Our experimental results demonstrate that compared with the method based on the supervised training, the proposed system makes 9.0% absolute improvement on female model and 7.4% on male model in terms of average accuracy. Moreover, the enhanced co-training algorithm achieves comparable performance to the co-training prototype, while it can reduce the classification noise which is produced by error labeling in the process of semi-supervised learning. 1. INTRODUCTION In the everyday interaction among human beings, speech is a primary and efficient way to express emotions. The reactions to other persons’ expressions greatly enrich our lives and daily communications. With the development of the humanmachine intelligent interaction, enabling the computer to recognize human emotions from speech has attracted increasing attention in the artificial intelligence field. Over the last few years, the high correlation between speech signals and emotional states were revealed and applied in many studies on speech emotion recognition [1, 2, 3, 4]. These systems were performed on supervised learning and the flows of them were very similar, yet they mainly differed in emotion corpora, extracted features and classifiers. Intuitively, an emotion corpus is the foundation stone of each system and plenteous labeled data in experiments contribute to better performance. Nevertheless, labeling abundant examples is expensive and time consuming since it requires substantial annotation from human beings. In contrast, unlabeled examples are often read-
1-4244-1017-7/07/$25.00 ©2007 IEEE
2. FEATURES AND CLASSIFIERS For the task of speech emotion classification, temporal features and statistic features are adopted in most previous systems [1, 2, 3, 4]. The approaches based on temporal features analyzed each utterance in frames to obtain the instantaneous information and modelled frame-based feature vec-
999
ICME 2007
tor sequences by the Hidden Morkov Model(HMM) classifier. On the other hand, the systems using statistic features calculated the statistics of some frame-based parameters to form a fixed-length feature vector as the representation of an utterance and then performed recognition by discriminative classifiers such as Support Vector Machine(SVM), Gaussian Mixture Model(GMM), etc. In [11], Cowie et al. indicated that these two types of feature representations associated with different aspects of emotions and more detailed analyses in [2, 4, 12] demonstrated that either of the representations conveyed enough information to achieve efficient performance in speech emotion recognition. The temporal features and statistic features do satisfy the requirement of two conditionally independent views in the co-training prototype, so we extract the two types of features for our enhanced co-training algorithm. In this paper, the statistic features extracted are means, standard deviations, maximums and minimums of F0, delta F0, log energy, first and second linear prediction Cepstral coefficients(LPCC), so each utterance is represented by a 20dimensional feature vector. Correspondingly, the discriminate classifier employed here is SVM. And for solving the multi-class problem, SVM is extended with an ’one-againstone’ strategy to obtain a multi-SVM classifier which empirically performed well in many applications [13]. Besides, the extracted temporal features are 12 mel-frequency Cepstral coefficients(MFCC) which have been often used in speech recognition and emotion recognition systems. Then the HMM classifier is implemented on the instantaneous features.
Table 1. The enhanced co-training algorithm. Inputs: A set of initial labeled examples: L A set of unlabeled examples: U The maximum number of training iterations: K A temporary set of examples: T = ∅ Process: Build a classifier: h1 ←HMM(L, x1); Build a classifier: h2 ←multi-SVM(L, x2); While iterations ≤ K Recognize examples: (La1, P rob1) ← h1(U, x1); Recognize examples: (La2, P rob2) ← h2(U, x2); For each example in U If label predictions in La1 and La2 are identical Add the example and its label assigned into T ; End End For each class Ci, (i = 1, 2, . . . , 6) Pick N i examples from T according to P rob1, and denote them by P i, i.e. the examples in P i are most confidently labeled by h1 as class Ci; End For each class Ci, (i = 1, 2, . . . , 6) Pick N i examples from T according to P rob2, and denote them by Qi, i.e. the examples in Qi are most confidently labeled by h2 as class Ci; End If P i ∪ Qi, (i = 1, 2, . . . , 6) is empty Break; End L = L ∪ P i ∪ Qi, (i = 1, 2, . . . , 6); U = U − P i − Qi, (i = 1, 2, . . . , 6); T = ∅; Rebuild the classifier: h1 ←HMM(L, x1); Rebuild the classifier: h2 ←multi-SVM(L, x2); End Outputs: Two trained classifiers: h1 and h2
3. AN ENHANCED CO-TRAINING ALGORITHM The enhanced co-training algorithm proposed is described in Table 1. Given a set L of labeled training utterances and a set U of unlabeled training utterances, we extract temporal feature set x1 and statistic feature set x2 of all these examples. Then use x1 portion of L to train a HMM classifier h1 and use x2 portion of L to train a multi-SVM classifier h2. Afterwards, the algorithm iterates the following procedure and we set the maximum number of iterations K = 18. First, recognize the unlabeled examples in U by h1 and h2 respectively. The label predictions and corresponding class probabilities are recorded in (La1, P rob1) or (La2, P rob2). The class probabilities P rob1 and P rob2 are regarded as the confidence estimates for later selection. Secondly, pick the examples whose label predictions from two classifiers are identical and add them into the temporary set T . It reinforces the co-training prototype [6] which directly selects most confidently labeled examples from U rather than T . This qualification on identical label prediction ensures that two classifiers have agreed with each other though they may differ much in the class probabilities. Then we begins the selection of examples from T . For each class Ci, (i = 1, 2, ..., 6), choose N i examples which are most confidently labeled as Ci by
h1 and add them into the training set L. The similar process is implemented on T according to P rob2. The values of N i, (i = 1, 2, ..., 6) are determined by the underlying data distribution, and we fix them to be 5 in the experiments. The reason is that our corpora collected are in the balanced distribution, which have equal size of examples for each class. When recognizing test data by the output classifiers, both temporal features and statistic features of test examples are extracted firstly. Then two feature sets are recognized by the trained classifiers h1 and h2 correspondingly. The predictions of two classifiers are combined by multiplying the posterior probabilities together to make a final decision.
1000
4. EXPERIMENTAL RESULTS Average Accuracy on Test Data
83%
4.1. Emotion corpora The experiments carried out here use two emotional speech corpora consisting of short utterances in Chinese mandarin, one from a female and the other from a male. Since some acoustic parameters of speech are highly sex-dependent, i.e. the differences of acoustic features between female and male are significant, we consider two corpora respectively to investigate the speaker-dependent emotion classification. In each corpus, there are 300 different sentences expressed in six emotions corresponding to Anger, Fear, Happiness, Neutral, Sadness and Surprise, for a total of 1800. Firstly, one third of the 1800 utterances are randomly selected for testing and the remaining two thirds for training. Then training examples are partitioned into labeled and unlabeled sets, where 5% are regarded as initial labeled data for system initialization while the remaining 95% are used as unlabeled examples by removing their label information. The entire experiment is repeated five times with random labeled, unlabeled and test data splits. The recognition system on each corpus is evaluated in terms of average accuracy.
80% 77% 74% 71% Male model Female model
68% 65%
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18
Enhanced Co−training Iterations
Fig. 1. Average accuracy versus the number of iterations. Table 2. The comparison of average recognition accuracy among supervised training, enhanced co-training algorithm with K = 18 and co-training prototype with K = 18. Female model Male model Supervised training 66.83% 73.56% Our enhanced co-training 75.87% 80.93% Co-training prototype 72.53% 79.20%
4.2. Performance Analysis In the experiments, we predict the labels of test data by classifiers obtained at different number of enhanced co-training iterations and the experimental results are averaged over five random data splits. Fig. 1 gives a plot of the average recognition accuracy on test data when the number of iterations is incremental in our algorithm. The accuracy at ’0’ represents the performance of the method on supervised learning, i.e. recognizing test data by the model built on the initial labeled examples. It is shown that along with the increasing number of iterations, the algorithm achieves better performance while a little fluctuation is present at the beginning. The fluctuation is caused by some classification noise. This noise is produced by the incorrect labeling of unlabeled examples, which is used to augment the training set L. In our studies, the noise makes significant effect when the number of iterations is less than 5. On the whole, the enhanced co-training algorithm with K = 18 makes 9.04% absolute improvement on female model and 7.37% on male model, compared with the method on supervised training. On the other hand, the co-training prototype is implemented on our corpora. From Table 2, we can see that the enhanced algorithm achieves comparable and a little better performance than the co-training prototype, since it has involved a qualification on identical label prediction to decrease the classification noise. Actually, the noise does not only affect the system performance at less iterations, but also potentially degrade the performance of the co-training prototype when more and more examples with error labels are included in the training set L. In Table 3, we list the number of examples which are selected and added into the set L at
each iteration, and the number of error labeled examples according to human annotators. It is demonstrated that the total error rate of labeling in our training process is much less than the co-training prototype by 5.6% on female model and 9.1% on male model. 5. CONCLUSIONS AND FUTURE WORK In this paper, we proposed a speech emotion recognition system based on semi-supervised learning. An enhanced cotraining algorithm was presented to utilize the temporal features and statistic features of plenteous unlabeled examples to reduce the need for expensive labeled data. The experimental results indicated that our method on both female and male models achieved better performance than the method based on supervised learning. Furthermore, compared with the cotraining prototype, the enhanced algorithm with qualification on identical label prediction produced less classification noise and made a little improvement in average accuracy. Although the noise is depressed to a certain extent, more improvement will be made to the proposed method and bimodal emotion recognition on semi-supervised learning will be included in the future work.
Acknowledgement This research was supported by China Postdoctoral Science Foundation(20060401040).
1001
Table 3. The comparison of the classification noise between the co-training prototype and the enhanced co-training algorithm(abbr. EN-CO). The column of ’%’ is the error rate of labeling in percent(i.e. % = Err./Add.). Female Co-training EN-CO model Add. Err. % Add Err. % 1 60 18 30.0 59 13 22.0 2 60 12 20.0 59 14 23.7 3 59 16 27.1 58 12 20.7 4 58 16 27.6 58 13 22.4 5 59 13 22.0 58 12 20.7 6 59 20 33.9 58 11 19.0 7 57 15 26.3 57 11 19.3 8 58 17 29.3 58 13 22.4 Iteration 9 59 17 28.8 56 12 21.4 Number 10 59 15 25.4 57 12 21.1 11 58 15 25.9 56 12 21.4 12 58 15 25.9 56 13 23.2 13 59 16 27.1 54 14 25.9 14 58 19 32.8 55 15 27.3 15 56 18 32.1 52 15 28.8 16 53 17 32.1 49 11 22.4 17 53 17 32.1 45 15 33.3 18 52 26 50.0 35 13 37.1 Total 1035 302 29.2 980 231 23.6 Male Co-training EN-CO model Add. Err. % Add Err. % 1 60 12 20.0 60 5 8.3 2 60 11 18.3 59 8 13.6 3 59 9 15.3 58 3 5.2 4 59 9 15.3 58 4 6.9 5 59 5 8.5 58 3 5.2 6 59 10 16.9 58 4 6.9 7 58 10 17.2 57 3 5.3 8 59 11 18.6 57 4 7.0 Iteration 9 58 10 17.2 56 2 3.6 Number 10 58 11 19.0 55 7 12.7 11 58 11 19.0 56 5 8.9 12 57 10 17.5 55 6 10.9 13 55 9 16.4 51 5 9.8 14 56 12 21.4 49 5 10.2 15 55 11 20.0 43 6 14.0 16 53 9 17.0 41 7 17.1 17 52 12 23.1 31 4 12.9 18 48 15 31.3 25 4 16.0 Total 1023 187 18.3 927 85 9.2
6. REFERENCES [1] D. Ververidis, C. Kotropoulos, and I. Pitas, “Automatic emotional speech classification,” Proc. of IEEE International Conference on Acoustics, Speech, and Signal
Processing, vol. 1, pp. 593–596, 2004. [2] O.W. Kwon, K. Chan, J. Hao, and T.W. Lee, “Emotion recogntion by speech signals,” Proc. of Eurospeech, pp. 125–128, 2003. [3] T.L. Nwe, S.W. Foo, and L.C. De Silva, “Speech emotion recognition using hidden markov models,” Speech Communication, vol. 41, no. 4, pp. 603–623, 2003. [4] B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1–4, 2003. [5] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled document using em,” Proc. of International Conference on Machine Learning, vol. 39, no. 2, pp. 103–134, 2000. [6] A. Blum and T. Mitchell, “Combining from labeled and unlabeled data with co-training,” Proc. of Annual Conference on Computational Learning Theory, pp. 92– 100, 1998. [7] B. Maeireizo, D. Litman, and R. Hwa, “Co-training for predicting emotions with spoken dialogue data,” Companion Proc. of Annual Meeting of the Association for Computational Linguistics, pp. 202–205, 2004. [8] X. Zhu, J. Lafferty, and Z. Ghahramani, “Semisupervised learning using gaussian fields and harmonic functions,” Proc. of International Conference on Machine Learning, pp. 912–919, 2003. [9] D. Zhou, B. Scholkopf, and T. Hofmann, “Semisupervised learning on directed graphs,” Advances in Neural Information Processing System, pp. 1633–1640, 2005. [10] S. Goldman and Y. Zhou, “Enhancing supervised learning with unlabeled data,” Proc. of International Conference on Machine Learning, pp. 327–334, 2000. [11] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, pp. 32–80, 2001. [12] Dan-Ning Jiang and Lian-Hong Cai, “Speech emotion classification with the combination of statistic features and temporal features,” Proc. of IEEE International Conference on Multimedia and Expo, vol. 3, pp. 1967– 1970, 2004. [13] C. W. Hsu and C. J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE transactions on Neural Networks, vol. 13, pp. 415–425, 2002.
1002