Towards automatic recognition of emotion in speech - IEEE Xplore

4 downloads 191 Views 301KB Size Report
Mohd Hafizuddin Mohd Yusof,. Prof. Ryoichi Komiya,. Faculty of Information Technology. Multimedia University. Multimedia University. Multimedia University.
Towards Automatic Recognition of Emotion in Speech Aishah AM. Razak,

Mohd Hafizuddin Mohd Yusof,

Prof. Ryoichi Komiya,

Faculty of Information Technology Multimedia University 63100 Cyberjaya, Selangor, Malaysia [email protected],

Faculty of Information Technology Multimedia University 63 100 Cyberjaya, Selangor, . Malaysia hafizuddin. [email protected]

Faculty o f Information Technology Multimedia University 63100 Cyberjaya, Selangor, Malaysia [email protected]

Absrmcl-This paper discusses an approach towards automatic recognition of emotion in speech using computer. First, a design for the emotion recognizer is proposed. LP analysis algorithm has been used far the speech emotion parameter extraction. A total of 22 speech features have been selected to represent each emotion. A database consisting of emotional Malay and English voice samples has been developed far training and recognition purposes. Fuzzy concept has been applied to recognize emotion of the selected voice sample. The result from computer recognition is compared to the human recognition rate to confirm the reliability of the result and also to explore how well people and computer can recognize emotion in speech. It is found that computer recognition of emotion is possible and the average recognition rate of 66% is satisfactory based on the comparison from the human perception. According to the confusion matrix table for both human and computer recognition, it is shown that the way human interprets emotion is different from computer.

Transmission

Generation Mobile

Phone

Keywords-component; speech processing; emotion recognizer; emotion parameter; LP analysis; fuuy concept 1.

Speaker

Decoder

INTRODUCTION

The importance of emotion recognition of human speech has increased in recent days to improve both the naturalness and efficiency of humamachine interactions. The speech interface, which is used in the human-machine interface system, has the advantage of simple usability. Possible applications of automatic emotion recognition are camera-less mobile videophone [I], interactive movie [Z], automatic dialog systems in call centers [3], and intelligent communication system with talking head images [4]. In the case ofcamera less mobile videophone as in Fig. 1, it intends to reconstmct facial expressions at the receiving end based on the extracted emotions from the received voice tone.

Fig”= 1. Basic block diagram of the Cameraless mobile videophone system

II.

THE DESIGN OF EMOTIONRECOGNIZER

The design of emotion recognizer basically involves 2 staees which are illustrated in Fie.2. thev are emotion

The emotion recognizer system identifies the emotion state of the input speech signal and displays the corresponding facial expression of that particular emotion: There is no need to transmit the actual video signal from the transmitter side. Therefore, the existing 2G mobile phone infrastructure can be used.

Emotion Code

*

Figure 2. Design of emotion recognizer

548

Emotion parameter extraction This stage, which involves the speech processor, deals with the emotion features selection, speech preprocessing, and extraction algorithm. Determining emotion features is a crucial issue in the emotion recognizer design. This is because the recognition result is strongly depending on the emotional features that have been used to represent the emotion. Based on many speech studies [5][6][7], it is agreed that prosodic components, composed of pitch structure, temporal structue, and amplitude structure contribute more to the expressions of emotions in speech. Among the prosodic features, pitch, or fundamental frequency Ir,) has widely been recognized as a very important parameter to identify emotional state of the speech. It 6 reported that the 'contour off" versus time (pitch contour)' is the aspect of speech signal providing the clearest indication of emotional state of the speaker [SI. Other acoustic features are: vocal energy, frequency, spectral features, formants (usually only one or two first formants FI and F2 are considered), and temporal features (speech rate and pausing)[X]. Another approach to feature extraction is to enrich the set of features by considering some derivative features such as LPCC (linear predictive coding cepstrum) parameters of signal 191, or features of the smoothed pitch contour and its derivatives [IO]. A detail analysis has been done on a selected emotion parameters [11][12]. Based on these analyses, a total of 22 features have been chosen to represent the emotion features. The 22 features are pitch@), -jitter(it), first three formants (FI, F2,F3), speech duration(d), speech energy(e), zero crossing (zcr) and 14 LP coefficients (al-ai4). The LP coefficients are included because we intended to use LP analysis for the extraction algorithm. Besides, it represents the phonetic features of speech that is often used in speech recognition [2J Once the parameters have been identified, we then proceed to the extraction algorithm studies.

A.

1-3 sec digital speech signal

Speech

processing segment Analysis I I (ASCII format, IO kH2 sampling rate and 16bits acCuraEVl

( e , a,, a%ai

..._aI4.d, pt,.jt, FI, F2, F3, zcr

B features exmction

Figure 3. Basicblack diagram of the speech processor

The speech processor as in Fig. 3 has three stages;

Preprocessing: The digitized speech sample is first normalized by its maximum amplitude and the d.c. component is removed. Next, the sample is segmented into 25msec frames with a Smsec overlap [13]. Then it is filtered with a zero phase filter to remove any low-frequency drift. I)

2) Linear Predictive Analysis: In this study, we have used a 13th order LP model for analysis and an orthogonal covariance method is used to calculate the LP coefficients [ 141. All the calculation for the LP analysis is developed using MATLAB function where the input is the segmented speech signal and the output are the 14 LP coefficients, first reflection coefficient, energy of the

549

underlying speech segment and energy of the prediction error. These output are then used to determine other parameters.

Speech period exfraction, pitch, jitter, zero crossing 3) rate and formants calculation [I31 Each speech segment is classified as voiced or unvoiced using the prediction error signal by simply setting a threshold. The first segment that is classified as voiced will mark the beginning of the speech period. After the beginning of the speech period, if the segment is classified as unvoiced for few consewtive frames, the speech is decided to be ended. The length of the speech period is calculated to get the speech duration The pitch period for each frame that lies within the speech period is calculated using the cepstrum of the voiced prediction error signal. The method is summarized below: One segment of the prediction error waveform, e(n) is low pass filtered and denoted as e&) The cepstrum-like sequence, Cdn) is calculated as below C,(n) = IFFT(1FFT e&)b

I n S N

(I)

where N is the frame size, FFT is the Fast Fourier Transform, and IFFT is the inverse FFT. Search for the index m, where C,(m) is the maximum amplitude in the subset {Ce(i) 25 =i = N). Search for the index k where C,(k) is the maximum amplitude in the subset (Ce(i)/ 25 = i = m25). If C&) > 0.7Cdm), k is the estimated pitch period, otherwise m is the estimated pitch period. If an abrupt change in the pitch period is observed, compare to previous pitch periods, then low-pass filter (or median filter) to smooth the abrupt change. The pitch (pt) is then estimated from the pitch period. The perturbation in the pitch period, known as jitter (it), of order one is calculated by taking backward and forward differences of lower order functions givm as I

0

0

p , = p i -pi-l

=Ui-Ui.,,

i = 2 ,..._,N

(2)

The pitch and jitter contours are smoothed with a 5-point median filter. Then we determine the first three formant frequencies of the speech signal in that frame from the LP coefficients. This is done by solving the roots of the LP polynomial. Formants, referred to as FI,F2, F3 represent the resonance frequencies of the vocal tract. Zero crossing rate (zcr) is counted to estimate the frequency of voicing. This is calculated by counting the number of sign change in the speech signal. All these 22 features are calculated for each frame that lies within the speech period. Thus, if there is m frames within the speech period the extracted features are 22xm. The final value that is to be used as the emotion data for each sample is taken as the average values of each feature within the whole voice sample.

B. Recognition This stage involves the reference panem for training and comparator for recognition. For the purpose of a preliminary evaluation, we have adopted the fuzzy modeling technique discussed in [IS] for the training and recognition process. I) Fuzzymodel The conceot of fuzzv sets is as follows. If there are ‘n’ possible features for eaih emotion and if there are ’m’ such samples, then a particular feature from each of the samples forms a fuzzy set. Thus, for each particular emotion, the resultant matrix thus formed is m x n. The reference emotional data set is obtained from the training voice samples. The mean and variances for a particular emotion are computed for each of the 22 features and stored in Knowledge Base (KB). The procedure is repeated for all the training data. Therefore, there are 22 mean and variances data corresponding to the 22 features of the 6 emotions namely happiness, sadness, anger, disgust, fear and surprise.

Given a very large nunber of samples, by choosing the fuzzification function, the membership function of each feature value in the fuzzy set can be determined. However, we need to compute the membership functions for the features of unknown emotions and not the reference emotion, The unknown emotion features are matched with all reference emotion features stored in the form of KB. It is possible to compute the membership functions by associating the features of the input emotion with the fuzzy sets. The KB consists of means mi and variance q2 for each of the 22 fuzzy sets calculated using the equations (3) and (4):

(3)

Thenxc r i f p m ( r ) isthemaximumfor r 4 , l . S However, the recognition of emotions by (6) using the fuzzification function ( 5 ) does not perform well. This is because for the case of emotion recognition, we have to consider the fact that there is no standard ways o f expressing emotion and some particular emotion can be expressed in more ways than others. This results some of the fuzzy sets have a very small variance and others have a large variance. Thus, m order to represent the possible deviations from the statistics, we introduce two smctural parameters s a n d t in the membership function ( 5 ) . The modified membership function is now given

(7) wheres and t are the structural parameters of the membership function. Thus, the structural parameterss,tmodel the variations in the mean and variance over all 22 emotion features. The choice of these parameters has implicit reasoning in the sense that if s = 1and 1 = -1, it yields the original membership function (5). If the values of s and 1 are perturbed around the above values it would reflect the changes taking place in the means and variances. Hence (7) is a generalized form of (5). In our first approach in introducing the structural parameter s and t, we have fixed the value of s to I .2 and I to I for all emotion. This value is estimated by applying gradient ascent method for value s a n d t with delta value 0.05.

Data selection for training and recognition First, a voice database consisting four short sentences frequently used in everyday communication are built. The sentences are: “ltu kereta saya”(1n Malay language), “That is my car” (In English), “Sekarang pukul satu” (In Malay language), and “Now is one o’clock‘‘ (In English), spoken in 6 basic emotions namely happiness, sadness, fear, anger, surprise and disgust. Total of 200 samples is collected for each emotion, which result 1200 samples (6x200) for the whole database.

2)

(4) where, Ni is the number of samples in the

fuzzy set and

il stands for the jhfeature value of reference emotion in the i

fuzzyset where(i=1,2 ...22).

For an unknown input emotion x, the 22 features are extracted using the LP analysis. The membership function is given by,

From these 1200 samples, we have randomly chosen 240 samples (40 samples each emotion) to form the knowledge base for the fuzzy model during the training. While for the recognition purposes, another 360 samples (60 each emotion) is randomly selected from the database The recognition system is designed so as to optimize the structural parameters in the fuzzification function, in order to enhance the recognition rate.

where, 4 is the i m feature of the unknown emotion. Equation ( 5 ) presumes that the unknown features are governed by the known statistics, namely the mean and variances of the fuzzy sets stored in KB. If all r,‘s are close to mi’s which represent the known statistics of a reference character, then unknown emotion is identified with this known emotion because all the membership functions are close to I and hence the average membership function is almost 1 as explained below. Let, mj (r), q2(r) belong to the

ih reference emotion with

r = O(happiness), I(sadness), 2(anger), 3(disgust), 4(fear),

5(surprise). We then calculate the average membership as,

550

111.

HUMAN EMOTION RECOGNITION

Recognizing emotion in speech is a difficult task even from the perspective of human. In order to know what is the satisfactory recognition rate to be achieved using our computer recognition system, we have conducted a listening test using human subjects. The 360 samples used in the computer recognition test, are randomized and divided into six sets. Then we ask the assessors to listen to the voice samples and choose the intended emotion. A total of 60 answer sets have been collected from the assessors.

v.

lV. RESULTS AND DISCUSSKIN Table I shows the confusion matrix table for the recognition rate achieved by the computer and Table 2 shows the confusion matrix table for the recognition rate achieved by human. TABLEI : CONFUSlONMATRIXTABLEFORCOMFCrrER

RECOGNITION

T.mLE2 :CONFUSION M A W TABLE FOR HUMAN RECOGNITION

For the computer recognition, emotion of happiness has the highest recognition rate and anger has the lowest rate. While in the case of human.recognition, emotion of disgust has the highest and still anger is the lowest. Based on the poor recognition rate for emotion of anger in both computer and human recognition, we assume that the samples used for anger are not satisfactory and need to be recollected. Besides, anger has numerous variants (for example, hot anger, cold anger, etc.), which can bring the variability into acoustical features and dramatically influence the accuracy of recognition [16]. The emotion of happiness, sadness and surprise are confused most with the emotion of fear in both computer and human recognition while for fear samples, it is most confused with happiness in computer recognition and disgust in human recognition. Comparing in general the confusion matrix table for computer and human, we can see that the human recognition is more scattered. This is because human way of interpreting emotion is complex and different from one individual to another. While as for the case of computcr recognition, it is only restricted to the reference data that has been used for the training. In this case, the better the reference data, the better recognition rate achieved. The overall human emotion recognition rate, which is 62.35%, is in line with what have been achieved by other studies which is around 5 6 5 % [2][16] Based on this rate, our computer recognition rate. which is slightly higher (68.59%), is considered satisfactory for an emotion recognizer system.

CONCLUSION AND RlTuR E WORKS

This study serves as a preliminary approach towards automatic emotion recognition using computer. The result shows that a computer recognition rate of 68.59 % is sufficient for emotion recognition based on the recognition rate achieved from human evaluation. Even though the overall recognition rate for computer is acceptable, however, the difference between individual remgnition rates is large. This issue requires further study to identify the most optimal values for the structural parameters in order to get the optimum recognition rate and to minimize the differences between individual recognition rates. Besides that, snce the computer recognition is depending on the reference data, thus improving the data is hoped to improve the recognition rate. The results of our experiments are limited to the recognition of human voice inputs whosetext s are identical to the texts stored in the voice database. Therefore, the emotion recognizer systems for the natural human voice inputs, which are independent of texts stored in the voice database, require further study VI. REFERENCES Aishah, A.R., Komiya, R., "A Fdiminary Study of Emotion Extraction from Voice," National Conference on Computer Graphics and Multimedia (CoGFAMM'OZ), Malacca. Nakatsu, R.,Nichalson ,J., Tow N.,"Emotion Recognition and Its Application to Computer Agent$ with Spantaneous Inleraaive Capabilities," lntemational Congress of Phonetic Science. R. Cowie, E. DouglasCowie. N. Tsapawulis. G. V0tsis.S. Kollim, W. Fellen& and J.G. Taylor, "Emotion recognition in humancomputer interac1ion;'IEEE Sig. hoc. Mag..val. 18(1), pp. 32-80, Jan 2WI. Morishima, S., Wrashima, H., "A Media Conversion F" Speech to Facial h a g e for lntelligent Man-Machine Interface," IEEE I. on Se!. Areas in G".1991 Carlson. R.. Gmnslrom, E., and Nord L, (1992). "Experiments with emotive speechacted utterance^ and synthesized replicas", Proceedings ICSLP 92, Banff Alter!+ Canada 1.671 474. Fairbark G. and Pronovost, W., (1939)."h Experimental Study of the Pitch Characteristic of The Voice During The Expression of Emotions," Speech Monog..,6,87-104 Cosmides L.(1983). "Invariances in the acoustic expression of emotion during speech,"Jaumal of Experimental Psychology: Human Perception and Performance,9,864-881 Bans. R. and Scherer. K.R. (1996) "Acoustic profiles in vocal emotion expression': J o u m l of Personolily and SociolPsychology 70: 614636, 1%.

Taw N. and Nakatsa R. (1996) "Life-like communication agentemotion sensing character "MIC and feeling Session character "MUSE. Proceediies of E E E Conference on Multimedia 1996.DD. .. 12-1 9 [IO] DellaeR F.. Polzin, m., and Waibel, A. (1996) "Recognizing emotions in speech." ICSLP 96. [ I l l Aishah A.R.. Mohamad [mi,Z.A.. Komiya, R.. "Pitch Vkiatian Analysis an Malay and English Voice Samples", to appear in APCC(2W3)

[I21 Aishah A.R., Mahamad h i .Z.A., Komiya, R.,"A Preliminary Speech Analysis far Emotion Recognition", to apper in IEEE Student Confnence on Research and Devclopment(SCORED2W3) [I31 ChildenD.G.(1999).SpeechProcessing and Synthesis Tmlbones. John Wiley & Sons. NY [I41 Ning, 7. and Whiting, S. (1990) "Power specmm estimation is a orlhogonal nansfomation," Proc. IEEE Conf. Acaust., Speech, Signal Process.. 2523-2526 [IS] M. Hmmandlu, M.H.M. Yusaf and Vamsi K. Madam.(Z003) "Fuzzy bawd approach lo the recognition of Multi-Fontnumerals",2nd National Conference on Document Analysis and Recognition (NCDAR -2693). Mandya India. [I61 Vvlery a. Pemshin,(l999), "Emolion in speechRecognition and application to call centers", Proceed;ttEr of the 1999 Con/eeme On Arn>cial Neural Networks In Engineering (Aanie '99)

551

Suggest Documents