2016 European Modelling Symposium
Recognizing Emotional State Changes using Speech Processing Reza Ashrafidoost
Saeed Setayeshi
Arash Sharifi
Department of Computer Science Science and Research Branch, Islamic Azad University Tehran, Iran
[email protected]
Department of Computer Science Amirkabir University of Technology Tehran, Iran
Department of Computer Science Science and Research Branch, Islamic Azad University Tehran, Iran
[email protected]
Evidently, one of the key areas, which has attracted a lot of attentions to these systems is automatic emotion recognition (ASR) of human speech. Scientists, who have been working on voice and speech technology for the past four decades, now have good understanding of voice analysis, human speech and speech processing-based systems. Therefore, they have developed various useful applications in this field. With the respect to capabilities provided by speech signal analysis, researchers in the field of artificial intelligence (AI), robotics and human-computer interaction (HCI) could design machines that would be useful to develop tools and systems relating to human natural behavior. Some of these systems would be similar to responsive and adaptive systems to detect internal emotions and make it possible to anticipate individuals’ behavior. This could be seen in the systems such as speech production, judgment systems, evaluations systems, security and surveillance systems, speaker recognition systems, human-robot interactive machines (HRI), and generally the environments which are equipped with smart and cognitive workplace systems. To achieve this purpose, it is needed to automate streaming data collection from users to get optimal performance of these systems and could perform real-time services and compatible with user’s needs. In this paper, we propose an approach, which could acquire the characteristics of a speaker and his/her emotional changes by studying the specifications of the speech signal. This information is used to recognize the conceptual attributes of speech, which binds in voice of speaker by an intelligent machine. In this study, we also apply a speech corpus of utterances from the EmoDB as the input, and the MATLAB API as the simulation environment to implement this procedure of the speech signal recognition. To do this, at first, some pre-processing tasks are performed on the raw speech signal, and then the desired features are extracted by the Mel Frequency Cepstral Coefficient (MFCC) method, and then the attributes of each uttered spoken word of speech are extracted separately using a Learning Gaussian Mixture Model (LGMM), as an innovative classification approach. These emotional states which represent the emotional state of the speaker for each spoken word during speech are labeled and arranged side by side in according to speech stream. Finally, we could delineate the trend of emotional states of speaker during speech or conversation. By using the proposed approach, the emotional states of utterance are classified into seven standard emotional classes. These classes are happiness, anger, boredom, fear,
Abstract— Research on understanding human emotions in speech, seeks to find out utterance mood by analyzing cognitive attributes extracted from acoustical speech signal. Speech contains rich patterns which can be altered by mood of a speaker. This paper explores speech from database and long-term speech recordings to analyze of mood changing in individual speaker during long-term speech. We introduce a learning method based on statistical model to classify emotional states and moods of utterance, and also track its changes. With this object, the perceptual backgrounds of the individual speaker are analyzed, and then classified during the speech to extract patterns, which is embedded in speech signal. The proposed method, classifies emotions of the utterance in seven standard classes including, happiness, anger, boredom, fear, disgust, neutral and sadness. To this end, we call the standard speech corpus database, the EmoDB for the training phase of this approach. Thus, when pre-processing tasks done, the speech patterns and meaningful attributes have extracted by the MFCC method and selected by SFS method, and then we apply a statistical classification approach, LGMM, to categorize obtained features, and finally illustrate changes trend of the emotional states. Keywords—component; speech emotion recognition; humancomputer interaction (HCI); Gaussian mixture model (GMM); speech emotional state; mel frequency cepstral coefficient (MFCC)
I.
INTRODUCTION
The methods of speaking have a significant role in human communications, which are the natural ways to express the emotion and feeling in conversation. Besides, tone of voice is a method to express the state of emotion of the speaker. When an utterance expresses the word with an emotion that makes his tone of speech change, the meaning of the word is completed. Up to now, Emotion recognition from speech is one of the challenging fields in designing systems, which are based on intelligent human computer user interface. These systems could recognize the feelings include uttered speech, when equipped with intelligent emotional recognition techniques and algorithms [1]. This kind of systems could be used to describe the attributes of uttered speech including mental and cognitive background, and the emotions of speaker. This approach provides the possibility for intelligent or adaptive system designers to design machines, which make appropriate automatic reactions in accordance with natural human needs in different situations.
2473-3539/16 $31.00 © 2016 IEEE DOI 10.1109/EMS.2016.16
41
disgust, neutral and sadness. Despite the context of expressed talk during speech, the system could find out and track the trend of real internal emotional states of utterance. Emotional speech recognition using this method of classification provides appropriate results and high accuracy in emotion recognition and its trend of changes. The most prominent goal of this article is to propose an approach based on an innovative learning method based on Gaussian Mixture Model method (GMM) in emotional recognition of speech to extract internal emotions and feelings by processing the speech signal, and then represent the trend of changes in emotional states. This approach of speech pattern processing, could be used in intelligent systems, which have close interactions with human users to predict their emotional states. The systems, when equipped with this capability, could be used in the fields like medical, educational, and surveillance systems at smart work places. Up to the present time, researchers have focused specifically on localizing emotion transitions embedded in speech, and most of them focused on acoustical features of the speech signal. To point out, in 2010, S. Wu et al. studied on speech emotional recognition using modulation spectral features (MSF), in which they extracted speech features from long term spectro-temporal representation using an auditory and modulation filter bank [2]. Also, in 2012, X.Anguera et al. proposed a method to detect speaker change using two consecutive fixed-length windows, modeling each by Gaussian Mixture Model and applying distance-based methods, such as Generalized Likelihood Ratio (GLR), Kullback-Leibler (KL) divergence, and Cross Log Likelihood Ratio (CLLR) [3]. Additionally, in other studies for extracting emotion from speech, a number of useful methods like SVM [4], Variational Bayes free energy and factor analysis have used [5,6]. However, it seems that these methods require large databases for testing and training phases to be effective.
II. EMOTIONAL SPEECH DATABASE The emotional speech database which is provided by Berlin University is a standard collection of speech corpus, which is used widely in speech sciences and for speech processing scientific resources. This database includes audio recordings of ten actors and actresses (five male and five female), who have pronounced sentences in German by seven standard classes of emotions. These classes of emotions are happiness, anger, disgust, fear, neutral, sadness and boredom. In this process, each actor was asked to read one out of ten predetermined sentences which has more vowels with dedicated emotion. Approximately 800 recorded sentences were used to prepare this database and then 500 samples of them were selected to choose precisely with respect to emotion recognition by human factors. This method makes it possible to select the best sentences, which most closely represent to real natural emotions of speakers with particular emotional states. Also, it performs more accurate recognition rate with precision higher than 80% and natural selection with more than 60% of choices to increase performance and accuracy of this database [7]. In this experiment, we have used 454 examples of enounced spoken emotions as the input emotional speech, with respect to septet standard emotions which exist in EmoDB.
III. IMPLEMENTATION METHOD Different moods reflected in voice of speakers are represented by the particular patterns of acoustical features in the speech signals. This means that the meaningful information related to emotional states of the utterance is encoded in speech signal of speaker’s voice. This information would be decoded to disclose embedded emotions, and perceive when received by audiences.
Figure 1: Overview of emotional speech recognition routine
42
Therefore, the first step toward designing automatic emotional recognition (AER) systems is to find out how to encode emotional states, which are expressed by the speaker in the spoken utterance. This is done by extracting the most discriminating features from speech samples in training phase. Then, a classification method resolves this issue and decodes the data in order to recognize the class of particular emotional state [8]. Besides, the approaches which are commonly used for speech processing, are derived from the methods that are known generally as pattern recognition. In particular, each moment of the speech signal stream represents encoded data which leads to that the analytic works on speech emotion recognition (SER) are closely similar to pattern recognition cycle. So that, the words uttered in the input speech signal are analyzed separately and performed the routine to emotion recognition. Then changes in trend of emotional states determine the prevailed emotional feelings of the utterance during the lecture or conversation. This result is performed by the probabilistic filtering method to boost up classification accuracy. The overall view of the proposed approach illustrated in figure 1. A. Feature Extraction and Selection At the first stage of our proposed method, pre-processing tasks are performed on input speech signal using windowing techniques [10]. The windowing is done on each frame to obtain the spectrum scale of the speech signal, in which we use 256 frequency points to calculate the DFT [11]. We apply the Hamming windowing technique for framing the speech signal into long-term segments (figure 2). In this course of action, the 256 ms Hamming window with 64 ms frame shift, is provided to multiply by the speech signal, s(n), and we also consider that 16 kHz sampling rate is reasonably suitable for this work. Thereafter, frequency wrapping is employed to convert spectrum of speech to Mel scale where the triangle filter bank at uniform space is achieved [12]. These filters multiplied by the size of spectra, and eventually obtained MFCCs. In this paper, we also use 20 filter banks, and 12-MFCCs for feature extraction.
Figure 3: Classification and recognition diagram of proposed method for single word of speech
Moreover, the huge number of extracted speech features makes it naturally more difficult to classify the features accurately from the limited number of training samples. Consequently, applying the efficient feature selection method is inevitable. This is performed to convert obtained coefficients into the required coefficients and leads to decrease the size of feature vector, and so prevents curse of dimensionality at the classification process. To this end, we use the method called Sequential Forward Selection (SFS), which is based on an iterative algorithm, in which the selected feature subset is augmented before using as the classifier input [13]. B. Emotion Classification (LGMM Method) Using the proposed method, the emotion, which is associated with a single uttered word, is determined. The main purpose of speech processing by this approach is to recognize emotional states of the speaker and bring out the sequence of its changes during a longer piece of speech. The first level of the emotion recognition cycle is represented in figure 3. At this stage, the pre-processing tasks, including windowing are performed and also silent frames are removed from the input speech signal. Then required features of the speech signal are extracted and selected using the MFCC and SFS methods for each single word. We use a type of Gaussian Mixture Model that we have modified to perform learning as a learning-based GMM, which we have entitled this method LGMM [14].
Figure 2: Using Hamming window to time-domain signal
43
In the field of statistical sciences, a mixture model is considered as a probabilistic model. This model is used to represent the subsets of classes, which belong to the larger population. A Bayesian model, such as GMM, is a special case of these statistical models, which is straightforward but so effective. Indeed, these capabilities are significant due to its ability to form soft approximations and curved shapes for any form of distribution in random data. This model is used as a successful model in many different systems, especially in the field of speech recognition and speaker identification systems. Accordingly, Gaussian Mixture Modeling first invented by N. Day and latter by J. Wolfe in the late 60’s known as the Expectation-Maximization (EM) algorithm [15, 16]. Hence, the main reason of using this model in the wide range of intelligent systems is its ability to model the data classes or the distribution form of speaker’s acoustical observations [17]. To describe this mathematically, the classical form of GMM, it is illustrated the GMM likelihood function in equation 1. It has been used for a D-dimensional feature vector in which is a weighted sum of K multivariate Gaussian components, fi( ), is a D×1 for each mean vector (µi) and D×D covariance matrix (Σi). ( |
)=
( )=
( |
( )=
( |μ ,
)
)
In equation 1, λk stands for the parameters of GMM and include K components in order to the confine states in which the combined weights should satisfy by the following two conditions; ci≥0 for i=1,…,K and ∑ = 1 . i-th component could be written as equation 2. ( ) = ( | ) = ( |μ , ) =
1 (2ᴨ) | |
1 × exp (− ( − μ ) 2
( − μ ))
(4)
In equation 4, the μ represents the mean vector and Σi determines the covariance matrices. Also combined weights of general probability rule, emphasize the concept that sum of probabilities is equal to 1 and satisfy the main statistical rule which is represented by ∑ = 1 . The mathematically flexibility is the prominent advantage of using this method of speech modeling. The density of complete Gaussian components can only be shown by mean vectors and covariance matrices. These components are obtained from combination of weights of all density components. As a result, we could use a set of GMMs to calculate the probability of a particular emotion, which are prevailed by utterance. This method also uses maximum likelihood estimation, which should produce a classconditional probability density function for use in a Bayesian classifier. Suppose, we have a set of independent samples such as X={x1, x2,…,xN} derives from a data distribution, which is represented by the probability density function P(x;θ). In this function, the θ is the set of parameters of the PDF. The likelihood is represented in equation 5. (5) L(X; θ)=∏ ( ; θ)
(1) =
exp − ( − μ ) ∑ ( − μ ) ( ᴨ) | |
This equation represents the likelihood function of the distribution of X, or in a nutshell, it shows the distribution of parameter θ. The main purpose of this equation is to find θ, which maximize value of the likelihood (equation 6). (6) θ =arg × ( ; θ) This function most often does not reach to its maximum value, but the algorithm mentioned in equation 7 analytically and mathematically is evident and clear. This equation also called likelihood function. ( ; θ) = ln ( ; θ) = ∑ (7) ln ( ; θ)
(2)
In equation 1 and 2, Φi= (µi,Σi) represents the parameters for the i-th Gaussian density and ATr is the transpose of matrix A. Generally, a GMM could be identified by its associated parameters, the parameters are; λk = (ci,Φi, i=1,…,K). In this paper, we propose an enhanced derivation of Gaussian Mixture Model to provide classes of emotions using combination of Gaussian densities. To give mathematical details, a Gaussian Mixture Model is generally a weighted sum of several Gaussian components. In other words, Gaussian Mixture Model is a linear combination of M Gaussian densities, which is represented in equation 3. (3) P( | ) = ∑ ( ) According to the recent equation, is a D-dimensional stochastic vector, ( ) are density components for i=1,...,M and pi are combined weights for i=1,..,M . Each component of the Gaussian function is D-dimensional and in the form of equation 4.
44
Due to the monotonicity of the logarithm function, a solution that mentioned in equation 8 has similar usage to L(X;θ). According to these definitions, the implementation steps of LGMM classifier are as described below. At the first step, the parameters are initialized, and then mathematical expectation is taken based on previous probabilities for i=1,…, n and then k=1,…,K are calculated. ( ) ( ) ( ) Ø( μ , ) (8) , = ( ) ( ) ( ) ( ) Ø( μ , )
( (
μ
(
μ
Then maximization of the likelihood value is come out. ∑ , ) (9) = )
)
=
=
∑ ∑ ∑(
,
(10) ,
)
(
,
( μ ∑
) ,
(
)( μ
)
)
(11)
Figure 4: Sample diagram of speaker’s emotion states during the speech
This trend shows the feeling changes of the speaker when expressing talk or speech. The system could measure changes in emotional states of the utterance by applying the proposed approach. Figure 4 illustrates the trend of an instance speech. This diagram shows changes of emotional states and moods of the speaker during expressing speech. In Figure 4, it is also obvious that the mood of the speaker changes between anger, fear, disgust, neutral and sadness during speech.
And as long as the data converges, the steps of getting mathematical expectations and maximizations are repeated iteratively. In the next step the features are obtained by applying MFCC and SFS methods which are in the form of 12dimensional space. Next, we begin by using a Gaussian component for each emotional class and then calculate parameters. This phase of proposed approach as the training phase in which the learning tasks take place. After that, each component divides into two parts and retrained repeatedly for each part. Divisions and trainings continue repeatedly until they reach the final number of required components. The procedure which could give us an idea to implement our favorite training phase is an iterative algorithm to find the best parameters that we need. This algorithm could be Expectation Maximization (EM), which is the extended version of Baum-Welch algorithm [18]. The EM algorithm was used to model the Probability Density Function (PDF) of the emotional speech prosody features in [19, 20]. By using this method, the optimal Gaussian components are obtained at last in repeated iterations and the training task of LGMM done successfully. Furthermore, the samples in each class of mentioned classes have been randomly partitioned into 10 subsets approximately equal in size, and then each validation trial takes nine subsets from every class of training, with the remaining subset kept invisible from the training phase and used just for testing phase. Further, the training phase just performs once when the application begins to run. At this stage of emotional classification, all previously mentioned steps perform on feature vector, which acquired for each single spoken word in uttered speech signal separately. Then the emotion of utterance during expressing the particular word in speech is recognized. At the final stage of this classification level of proposed approach, the labels of emotional classes as the emotional states are obtained to the number of uttered words in the whole speech. Finally, we can show the changes trend of the emotional states of utterance during speech.
IV. EXPRIMENTAL RESULTS In this paper, we propose a method for emotion recognition of utterances during a speech or talk using extracted features of the speech signal. Applying and testing the developed method on the data in EmoDB, the results are investigated using Cross-validation method and represented with evaluation parameters in form of accuracy. Recognition accuracy is provided for each of seven emotional states in the domain of emotional classes. The results are obtained based on performance of proposed method on Berlin emotional speech corpus (EmoDB). The recognition accuracy rates are shown in table 1. Table 1: Recognition accuracy rate on EmoDB
C. Changes of Emotions Emotional information which is embedded in the speech signal is derived from expressed speech input or from parts thereof, this uttered emotional information being descriptive for an emotional state of a speaker and its changes [10]. Next, we have automatically obtained a diagram of emotional changes, during speech using proposed method.
Emotion
Classification
Happy
81.53 %
Angry
84.12 %
Boredom
86.82 %
Disgust
85.46 %
Fear
81.28 %
Neutral
86.37 %
Sadness
88.16 %
Consequently, we have calculated analytical and statistical parameters which stem from obtained result. We do have achieved to the score 2.44 as the standard deviation for accuracy rates of the seven emotional states. This result shows that our approach represents high degree of stability in emotion recognition from speech. Also, we have achieved 45
scores, 5.97, 0.0288, 6.88 and 84.82 as variance, dispersion coefficient, variation range and geometric mean, respectively. As well as, the acquired dispersion coefficient also emphasizes on the sustainability of system. It is also clearly depicted the comparison of recognition rates in figure 5. This Diagram shows that the highest rate of emotion recognition is achieved in evaluating emotions of Anger and Boredom, and the lowest one is the emotion of Fear. This result eventually is expected and predictable because of the distinctive attributes of the emotion of Anger and Boredom, and Fear.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Figure 5: Comparison chart of recognition rates obtained for each emotion
[11]
Accordingly, the attributes of these emotions, which extracted from acoustical signal of speech, are similar which conform to the same pattern. This similarity in pattern obviously can be seen in this diagram.
[12]
[13]
V. CONCLUSIONS We have demonstrated an approach for speech emotion recognition (SER) using a statistical classification method, which is based on a probabilistic method to obtain changes trend of emotional states of the speaker. To this end, we applied a modified version of GMM as a basis for this approach of emotion classification, which we have named it, Learning Gaussian Mixture Model (LGMM). Due to the admissible results in recognition accuracy rates obtained using the proposed method. The main motivation of this research is to recognize the trend of changes in feelings and emotions of speaker during speech. A prominent advantage using this method is to depict a view of emotional behavior of the speaker, regardless to speech context, instant events or factitious behaviors during speech or conversation.
[14]
[15] [16]
[17]
[18]
[19]
REFERENCES [1]
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G., “Emotion recognition in humancomputer interaction”, IEEE Signal Processing magazine, vol. 18, no. 1, pp. 32-80, January 2001.
[20]
46
S. Wu, T. H. Falk, W. Chan, “Automatic speech emotion recognition using modulation spectral features”, Journal of Speech Communication, May, 2011, vol. 53, pp. 768–785. X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, “Speaker Diarization: A Review of Recent Research”, IEEE Transactions on Audio, Speech, and Language Processing. DOI: 10.1109/TASL.2011.2125954. B. Fergani, M. Davy, and A. Houacine, “Speaker diarization using one-class support vector machines,” Speech Communication, vol. 50, pp. 355-365, 2008. DOI:10.1016/j.specom.2007.11.006. F. Valente, “Variational Bayesian Methods for Audio Indexing,” PhD. dissertation, Universite de Nice-Sophia Antipolis, France, 2005. DOI: 10.1007/11677482_27. P. Kenny, D. Reynolds, and F. Castaldo, “Diarization of telephone conversations using factor analysis,” Selected Topics in Signal Processing, IEEE Journal of, vol. 4, pp. 1059-1070, 2010. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B., “A database of German emotional speech”, INTERSPEECH, pp.1517–1520, 2005. Yang, M.Lugger, “Emotion recognition from speech signals using new harmony features,” Special Section on Statistical Signal & Array Processing, vol. 90, Issue 5, May 2010, pp. 1415–1423, DOI: 10.1016/j.sigpro.2009.09.009. S. Davis, P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Audio Speech Language Processing. 28, 357–366, 1980. Wojtek Kowalczyk, C. Natalie van der Wal., “Detecting Changing Emotions in Natural Speech”, Springer Science Business Media New York, Appl Intell (2013) 39:675–691 DOI:10.1007/s10489-013-04491. Sara Motamed, Saeed Setayeshi, “Speech Emotion Recognition Based on Learning Automata in Fuzzy Petri-net”, Journal of mathematics and computer science, vol. 12, August 2014, pp. 173185. Rahul B. Lanjewar, Swarup Mathurkar, Nilesh Patel, “Implementation and Comparison of Speech Emotion Recognition System using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques,” In Procedia Computer Science 49 (2015) pp. 50-57, DOI:
[email protected], 2015. J. Kittler, “Feature set search algorithms,” Journal of Pattern Recognition and Signal Process, 1978, pp. 41–60. R.Ashrafidoost, S. Setayeshi, “A Method for Modelling and Simulation the Changes Trend of Emotions in Human Speech”, In proceeding of 9th European Congress on Modelling and Simulation (Eurosim), Sep. 2016, pp. 444-450, DOI:10.1109/EUROSIM.2016.30. J. H. Wolfe, “Pattern clustering by multivariate analysis,” Multivariate Behavioral Research, vol. 5, pp. 329-359, 1970. D. Ververidis, C. Kotropoulos, “Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm,” IEEE International Conference on Multimedia and Expo, Amsterdam, 2005. DOI:10.1109/ICME.2005.1521717. H. Farsaie Alaie, L. Abou-Abbas, C. Tadj, “Cry-based infant pathology classification using GMMs,” Speech Communication (2015), DOI:10.1016/j.specom.2015.12.001, 2015. L. R. Welch, “Hidden Markov models and the Baum-Welch algorithm,” IEEE Information Theory Society Newsletter vol. 53, pp. 1, 10-13, Dec 2003. B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - belief network architecture,” in Proc. 2004 IEEE Int. Conf. Acoustics, Audio and Signal Processing, May 2004, vol. 1, pp. 577-580. C. M. Lee, S. Narayanan, “Towards detecting emotion in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, 2005. DOI:10.1016/j.specom.2010.08.013.