SPEECH EMOTION VERIFICATION SYSTEM (SEVS) BASED ON MFCC FOR REAL TIME APPLICATIONS Norhaslinda Kamaruddin and Abdul Wahab School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798 Tel: 65 6790-4948, email:
[email protected] Keywords: Speech emotion, MFCC, MLP, ANFIS, GenSoFNN.
Abstract Human recognizes speech emotions by extracting features from the speech signals received through the cochlea and later passed the information for processing. In this paper we propose the use of Mel-Frequency Cepstral Coefficient (MFCC) to extract the speech emotion information to provide both the frequency and time domain information for analysis. Since features extracted using the MFCC simulates the function of the human cochlea, neural network (NN) and fuzzy neural network algorithm namely; Multi Layer Perceptron (MLP), Adaptive Network-based Fuzzy Inference System (ANFIS) and Generic Selforganizing Fuzzy Neural Network (GenSoFNN) were used to verify the different emotions. Experimental results show potential of using these techniques to detect and distinguish three basic emotions from speech for real-time applications based on features extracted using MFCC.
1 Introduction Over the decades, much research has been carried out to understand speech emotion from multi-disciplinary perspectives. Researchers from neuroscience area understand the way brain perceives the input stimuli by processing the raw data and applying the knowledge stored in the brain to reflex the various actions. Computational intelligence researcher provide tools to justify the knowledge gained from the neuroscientists in mathematical solution while linguists would rather view speech that can provide emotional knowledge through semantic and syntactic of the speech. These combinations of interdisciplinary researches have shown tremendous potential in understanding speech emotion. Scherer provides a review of the various research paradigms into the vocal communication of emotions based on the modified Brunswik’s functional lens model of perception [1]. In fact, many other researches have looked into identifying and recognizing emotions/stress based on either a simple spectral analysis or even using the hidden Markov Model [2,3,4]. Features extracted from the pitch, formants, short-term energy, and many more has been utilize for emotion or stress recognition. Most of these works provide either a simple stress or neutral behavior of the subjects or limited emotion understanding.
Human propagates emotions information and are able to recognize them. From this understanding, we know that emotion plays a great role in our conversation. Everyday, lots of communication gadget are developed and launched to the user. And yet, so far, the emotion verification systems are still not in use. We are trying to fill in the gap by introducing Speech Emotion Verification System (SEVS) to be embedded in real time application like mobile phone and PDA. SEVS can help us determining other speaker’s emotion and gives us better understanding of the information he tries to convey. This can help in improving of speech recognition system as emotion plays a role in speech and pronunciation. When dealing with real time application system, especially for mobile application, it is crucial for the system to have a very fast processing time or alternatively an algorithm that requires low computation. The user would assume that the system will detect almost automatically the emotion of the speaker with a fairly reliable and accurate result. Thus, the processing time must be no more than several seconds with accuracy of up to 80%. Verification offers such a fast computation to recognize specific emotion. This can be applied particularly in call center that handles massive number of calls. In this paper, the Mel-Frequency Cepstral Coefficient (MFCC) feature extraction method is used. One artificial neural network (ANN) based on the multi-layer perceptron (MLP) and two other fuzzy neural network (FNN) system based on Adaptive Network-based Fuzzy Inference System (ANFIS) and Generic Self-organizing Fuzzy Neural Network (GenSoFNN) were used as the SEVS. Experimental results show the potential of using the MFCC with the FNN system for emotion verification with a reasonable processing time and good accuracy.
2 Emotion classification ANN has emerged as an alternative and practical soft computing tool in recent years, particularly in the field of pattern recognition and classification [5]. Some commonly used ANNs include the Multi-Layer Perceptron (MLP) [5] and Adaptive Resonance Theory (ART) [6]. Two limitations associated with most artificial neural networks are their long training process and the determination of an optimal boundary when handling real-life data due to the ambiguous nature of such data. The rules obtained from the neural network are also hard to comprehend as if we are treating the data in the black box without knowing what will be the outcome from the process. Therefore, the
In the next section the concepts of short-time Fourier transform will be introduced followed by the feature extraction based on the MFCC. Section 3 will discussed the different methods for classification based on the MLP and two other FNN algorithms, namely: the ANFIS and GenSoFNN.
3 Feature Extraction Figure 1 shows the block diagram of the SEVS. Speech data received is first divided into frames (between 128 to 1024 samples per frames) and windowed using the “hamming” window. Each frame of the speech data is then Fourier transformed. To comply with the mechanism of cochlea filter responds, the MFCC is thus derived. Features extracted from each frames of the MFCC will then be used by the SEVS to recognize and verify the emotions. These features actually will mimic the way human cochlea extracting features of the speech and provides both the representation of the time varying properties and the cepstral information of the speech data. 3.1. Mel Frequency Cepstral Co-efficient (MFCC) Mel-Frequency Cepstral Coefficient (MFCC) is derived from the FFT of the speech signal, where the frequency bands are positioned logarithmically on the Mel scale. This is done to approximate the response of the human auditory system more closely compared to a linearly spaced frequency band of Fourier Transforms. Cochlea is the organ that translates the sound energy into the nerve impulses for the brain usage. It is not just acts as an input device but cochlea will filter the sound, extract the features and only pass the needed features (features selection) to be used by the brain. People with cochlea impediment cannot differentiate between background noise and voice even with the help of listening aid. This is because of they lose the cochlea ability to filter the signal [15].
In order to obtain more information, the speech data will be broken up into frames where the windows are usually allowed to overlap in time typically by 25-80% [16]. The processed speech sequence s(t) is given by the following equation. s (t ) = L + [ s( n−1) N +1+λ , s( n−1) N + 2+λ ,.., snN , snN +1 ,.., snN +λ ]T + LL + [ snN +1 ,K, s( n +1) N ]T + L
(1)
+ [ s( n+1) N −λ +1 ,..s( n+1) N , s( n+1) N +1 ,.., s( n+ 2 ) N −λ ] + .. T
Where N is the number of samples per frame of data that is being processed, n is the frame number, λ is the number of samples that is overlapped between the present and the previous frame of data, and T is the transpose of the matrix. The first set of data in equation (1) represents the previous frame of data followed by the present frame of data and lastly the next frame of data. It can be seen from equation (1) that when there is no overlap (i.e. λ = 0) the string of data s(t) is just a data sequence with N number of samples per frame of data. There are many different ways to handle the overlap data sequence. One of the simplest forms is the averaging method, where the overlap sequences from the previous and the present frames of data are smoothed out by just taking the average of the two overlap frames of data. This Fourier transformation of the processed data sequence can reflect a signal’s time varying nature of the spectrum [17]. After each window, the complex result is added to a matrix which records magnitude and phase for each point in time and frequency [18]. C ochlear Filter 1
M FCC 1
Cochlear Filter 2
M FCC 2
Cochlear Filter N
M FCC N
FFT Em otional speech (x(t))
Fram e & window
Emotion classification System
repeatability of the same experiment is questionable. Thus, fuzzy logic was introduced as an approach to handling vagueness and uncertainty [7]. Fuzzy logic builds the rules using linguistic syntactic that are easy to understand and provide a basis for the repeatability. To obtain the best characteristic of these two concepts, fuzzy neural hybrid systems combines them by applying learning techniques of neural networks for the learning and identification of fuzzy model parameters. These systems offer strong generalization ability and fast learning capability from large amount of data; namely: the fuzzy set representing fuzzy concept/semantic and fuzzy rules that link the fuzzy concept of the input space with the fuzzy concept of the output space. Fuzzy neural systems like the Evolving Fuzzy Neural Network (EFuNN), Generic Self-organizing Fuzzy Neural Network (GenSoFNN)[8] and Adaptive Networkbased Fuzzy Inference System (ANFIS) [9] have been applied in several recognition systems with a high degree of accuracy [10,11,12,13,14]. In this research, MLP, ANFIS and GenSoFNN will be used for the speech emotion verification system (SEVS) based on features extracted in the short-time cepstral domain.
Em otions recognized/ verified
Speech em otion learning system
M icrophone (m 1 )
Figure 1 shows the proposed speech emotion verification system Due to the cochlea mechanics, MFCC attempts to capture the perceptually most important parts of the spectral envelope of audio signals. They are calculated in the following way: 1. Take the Fourier transform of the data sequence and calculate the frequency spectrum. 2. Filter the magnitude spectrum into a number of bands (40 bands for speech) and map the logarithmic amplitudes of the spectrum obtained above onto the Mel scale, such that low frequencies are given more weight than high frequencies using triangular
overlapping windows and sum the frequency contents of each band. 3. Take the logarithm of each band. 4. Compute the Discrete Cosine Transform of the list of Mel log-amplitudes 5. The MFCCs are the amplitudes of the resulting cepstrum The first step reflects that the ear is fairly insensitive to phase information. The averaging in the second steps reflects the frequency selectivity in human ear and the third step simulates the perception of loudness. Unlike the other steps, fourth step is not directly related to human sound perception, since its function is to decorrelate the inputs and reduce dimensionality and the last step is the outcome of using MFCC to the audio signal. There are numerous implementations of MFCC by several researchers like Davis and Mermelstein [19], Slaney [20], Young et al. [21] and Skowronski and Harris [22] due to the quest for better speech parameterization. These approaches are different in term of number of filters, filters’ shape, bandwidth and the manner in which the spectrum is warped, interest frequency range, selection of actual subset and number of MFCC co-efficient employed in the classification. Based on study by Ganchev et al. [23], approach by Slaney gives slightly better performance than others. Thus, for the purpose of this study, we opts approach by Slaney’s Auditory Toolbox for Matlab [20]. In Slaney’s Auditory Toolbox, some assumptions have been made to conform to Auditory Toolbox requirement. The sampling frequency used is 8 KHz with frequency range of 66Hz to 3427Hz. There are 40 coefficient filter banks used with triangle shape. The centre frequencies of the first 13 coefficient are linearly spaced in the range [100, 500]Hz with a step of 33Hz and the next 27 coefficient are logarithmically spaced in the range [535, 3200]Hz with a step logStep computed by the following equation: ⎛ ⎛ fc ⎞ log Step = exp⎜ ln⎜⎜ 40 ⎟⎟ ⎜ ⎝ ⎝ 500 ⎠
Where
⎞ numLogFilt ⎟ ⎟ ⎠
(2)
f c40 =3200 Hz is the centre frequency of the last of
the logarithmically spaced filters and numLogFilt = 27 is the number of logarithmically spaced filters. Each one of these equal area triangular filters is defined as: 0 ⎧ ⎪ 2 k − f bi −1 ⎪ ⎪ f − f bi −1 f bi +1 − f bi −1 H i (k ) = ⎨ bi 2 f bi +1 − k ⎪ ⎪ f bi +1 − f bi f bi +1 − f bi −1 ⎪ 0 ⎩
(
(
(
(
)(
)(
)
)
for k < f bi −1
)
for f bi −1 ≤ k ≤ f bi
)
for f bi ≤ k ≤ f bi +1
(3)
f bi are expressed in the terms of
position as specified above. The key to equalization of the area below the filters lies in the term:
(f
bi +1
2 − f bi −1
)
(4)
Due to the following term, the filter bank is normalized in such a way that the sum of co-efficient for every filter equals to one. Thus, the i-th filter satisfies:
∑
N k =1
H i (k ) = 1 for i = 1,2,..., M
(5)
Next, the equal area filter bank is employed in the computation of the log-energy output using the following equation:
⎛ N −1 ⎞ X i = log10 ⎜⎜ ∑ X (k ) ⋅ H i (k )⎟⎟ ⎝ k =0 ⎠
(6)
For i = 1,2,..., M Finally, the DCT in equation (7) provides the Slaney’s MFCC parameters. M ⎛ ⎛π C j = ∑ X i ⋅ cos⎜⎜ j ⋅ (i − 0.5) ⋅ ⎜ ⎝M i =1 ⎝
⎞⎞ ⎟ ⎟⎟ ⎠⎠
(7)
For i = 1,2,..., J
4 Classifications algorithm Once the features from the speech were extracted using the MFCC, the speech emotion is then recognized using the MLP [5] as the initial experiments to determine the initial accuracy of the SEVS. The MLP uses a combination of several perceptrons layer that interconnected to each other and exhibit a high degree of connectivity determined by the synapses of the network. It consists of three main layers that are known as input layer, hidden layer and output layer. In the input layer, the data is given to the network therefore number of neuron must be equivalent to the number of features of the data. Each data will be given weight by the network to be passed to the hidden layer where the nonlinear calculation will be done with the activation function. Consider the MLP network with d inputs and m outputs. The input and output vectors are respectively:
x = [ x1 , x 2 ,..., x d ]T o = [o1 , o 2 ,..., o m ]T If we denote
(8)
wij as the weight that connects output neuron i
with input neuron j, the activation value for output neuron i can be shown as (9)
for k > f bi +1
Where i = 1, 2, 3, …, M stands for the i-th filter,
DFT. The boundary points
f bi are M
+ 2 boundary points that specify the M filters and k = 1, 2, …, N corresponds to the k-th co-efficient of the N-point
d
v i = ∑ wij x j j =1
(9)
After receiving the activation, neurons will perform processing of the activation signal. The output o i is simply a transformation of nonlinear activation function to the v i in (10)
⎛ d ⎞ o i = ϕ (v i ) = ϕ ⎜⎜ ∑ wij x j ⎟⎟ ⎝ j =1 ⎠
(10)
The network will be tuned to convergence by using backpropagation algorithm that will use mean square error to evaluate performance. 4.1. ANFIS
Adaptive Network-based Fuzzy Inference System (ANFIS) [9] is a combination of neural network and fuzzy system to optimize the parameters of given fuzzy inference system by applying a learning procedure to the training dataset. It uses the hybrid algorithm – the mixture of least mean square estimation to determine the consequent parameter and backpropagation to learn the premise parameters.
that has the highest density measure. Then, the density measure for each data will be revised and reduced repeatedly until a stopping criterion is met. ANFIS will use this clustering data for the input to approximate the input value instead of taking the crisp value. 4.2. GenSoFNN
The Generic Self-organizing Fuzzy Neural Network (GenSoFNN) [8] is developed to emulate such functionalities to extract a set of high-level semantic IFTHEN fuzzy rules from the numeric training data that is presented. It employs an incremental and dynamic approach to the identification of the knowledge (fuzzy rules) that defines its connectionist structure. As opposed to ANFIS which offers three types of clustering namely: grid partitioning, subtractive clustering and fuzzy c-means clustering that need some fixed information to be fed in order for the clustering process to be done, GenSoFNN does not require prior knowledge of the number of clusters present in the training data set. By employing the Discrete Incremental Clustering (DIC) technique, separate cluster for noisy/spurious data that have poor correlation to the genuine data were created to increase its noise robustness capability. It also offers dynamic clustering where the data points can move from one cluster to another under varying conditions rather than static clustering that would not allow points to change to another cluster after it was committed to a given cluster as in the early stage of ANFIS.
A step in learning process has two passes. In the first pass, training data is bought to the inputs, the premise parameters are assumed to be fixed and the optimal consequent parameters are estimated by an iterative least mean squares procedure. In the second pass, the patterns are propagated again, but this time the consequent parameters are assumed to be fixed and back-propagation is used to modify the premise parameters. This is to enable fast convergence and guaranteeing the consequent parameters to be the global minimum in the consequent parameter space.
GensoFNN also adopts the Mamdani’s [25] fuzzy model and the k th fuzzy rule Rk has the form as defined in (11).
ANFIS adopt the first-order Sugeno model architecture due to its efficiency and transparency. For simplicity, we assumed that the fuzzy inference system has two inputs; x and y and one output; f. To present the ANFIS architecture, two fuzzy IF-THEN rules based on a first order Sugeno model are considered as the following:
x is IL ...and x is IL ...and x is IL THEN y is OL ...and y (11) is OL ...and y is OL
Rule 1: IF (x is A1) and (y is B1) THEN f1=p1x + q1y + r1
Where IL( i , j ) k the j th fuzzy label of i th input that is
Rule 2: IF (x is A2) and (y is B2) THEN f2=p2x + q2y + r2 where x and y are the inputs, Ai and Bi are the fuzzy sets, fi are the outputs within the fuzzy region specified by the fuzzy rule and pi , qi and ri are the design parameters that are determined during the training process. The task of the learning algorithm for this architecture is to tune all the modifiable parameters to make the ANFIS output match the training data. B
Generally, it becomes very difficult to describe the rule manually in order to reach the precision needed with the minimize number of membership function when the number of rules is larger than 3 [9]. Therefore, the subtractive clustering [24] is used to estimate the number of fuzzy clusters and cluster center. Each data point is considered as a potential cluster center and we will form the cluster based on the density measure in the defined radius of neighborhood (cluster radius). Large value of radius means a small number of clusters will be formed and vice versa. The first cluster center is chosen to be the data point
Rk : IF n1
(1, j ) k
1
( n1, j ) k
connected to
( l ,1) k
1
( l ,m ) k
k
and
m th output to which
OL(l ,m)
R
m
( l ,n 5 ) k
n5
R
(i , j )k
i
k
k
the l th fuzzy label of the
is connected.
The use of Mamdani fuzzy model provide intuitive and easy to understand rule-based as compared to the TakagiSugeno-Kang that have been adopted in ANFIS.
5 Experimental results and discussions 5.1. Data Collection
Data collection was done by using some video clips from movies and television sitcom obtained from the Internet. The six emotions are chosen due to Cornelius work in [26] that established the basic emotion list which consists of happy, angry, disgust, surprise, sad and neutral. The speakers are using English language as a medium of conversation. The emotions portrayed by the speakers will
be analyzed and identified based on the speech semantic, facial expression of the speaker as well as basic understanding of the situation of the audio clips occurrences. These video clips were converted to mp3 audio file with 8,000 sampling per second, mono stream and normalized amplitude of 1v. Listening comprehension surveys were then conducted to test human perceived emotion expressed by the audio clips. There are 11 subjects consisting of 9 male and 2 female who took part in the questionnaire with age profile ranging between 20 to 50 years old. Subject will listen to each audio clip without any prior information of the audio clip. Audio clips file name is designed to give little information to the subjects in order to avoid biasness. Subject will select appropriate emotion perceived and the overall accuracy identification percentage will then be calculated. Based on the result of user listening survey, five highest accurately identified clips that match researcher classification were selected for each emotion for the purpose of further exploration. From Table 1 it can be seen that most of the subjects cannot differentiate Disgust and Surprise emotion well. In fact, only 22% of the subjects perceived Disgust emotion correctly while 29% perceived it as Angry. Similar scenario happens with Surprise emotion where there are only 56% of the subjects that could perceive the emotion correctly. This leads us to not taken into consideration of the two emotions; Disgust and Surprise for our experiments. Thus, only Happy, Angry and Sad will be used for our experiments with Neutral serves as emotionless state.
activation function uses the Logarithmic Sigmoid function whereas the output layer is using pure linear function. Training time (s) Number of One layer Two layers neuron Sad Angry Happy Sad Angry Happy 10 270.92 247.07 245.58 5.828 7.751 9.828 15 323.75 1056.3 586.26 15.882 14.048 39.673 20 338.42 229.42 859.19 37.188 28.315 67.516 25 59.81 87.43 178.42 81.313 30 42.846 85.938 71.11 96.579 35 78.235 108.55 156.6 250.53 Table 2: Local Validation Result on Training Time for the Emotions using MLP. Preliminary experiment conducted using Sad against the Complementary Sad dataset provides a rough idea of the best number of neuron and layers to be used for emotions Happy, Angry and its complementary dataset with emphasis on 10 to 35 neurons for one layer and 10 to 20 neurons for two layers architecture. Both testing and training uses the same dataset. This should give us local validation result of whether the network can classify the dataset correctly. The result of the preliminary experiments shown in table 2 indicate that for the MLP architecture with 10 neurons in two hidden layers gives better performance as compared to MLP with 30 neuron in one hidden layer in terms of timing performance. It is also noted from the experiments that the accuracy of up to 99% can also be achieved. Thus, the MLP structure of two hidden layers and 10 neurons will be used for further experiment.
Perceived emotion (%) Emotion audio Happy Angry Disgust Surprise Sad clip Happy 76.4 0 0 9.1 0 Angry 0 89.1 5.5 0 3.6 Disgust 3.6 29.0 21.8 14.5 1.8 Surprise 5.5 5.5 1.8 56.4 1.8 Sad 0 0 0 0 98.2 Neutral 0 0 0 1.2 0
Neutral
14.5 3.0 20.0 23.6 1.8 98.8
Table 1: Confusion Matrix of User Survey Result
Training time (s) K-fold number Angry Sad Happy 400 5.88 9.64 11.84 800 7.75 9.38 11.78 1200 6.63 10.24 9.28 1600 8.27 7.63 12.02 2000 4.19 8.80 10.08 Average 6.54 9.14 11.00
Accuracy (%) Angry Sad Happy 87.00 87.75 76.75 88.50 88.25 75.00 88.00 80.75 79.50 89.50 83.75 79.00 88.25 91.25 76.96 88.25 86.35 77.45
Table 3: K-fold Global Training Time and Accuracy Validation Result for Different Emotions Using MLP.
5.2. Emotion Verification Using MLP
Optimal model selection for the number of layer and the number of neuron needed for the best MLP architecture is required to ensure optimum performance. Experiments were carried out using one and two hidden layers for the MLP architecture with different number of neuron ranging from 5 to 40 for each layer. The convergence goal was set to a mean square error of 0.01 with learning rate of 0.001. A high learning rate may cause the network output to be unstable and unreliable while too small the learning rate may cause too long computation time. We also set the number of epochs to 1000 as the network need appropriate number of iteration to ensure it can converge. The network consists of 40 input neuron given from the number of features extracted using MFCC with one output. The
In order to have better understanding of the speech emotion verification system (SEVS) we use the K-fold validation for our global validation. The dataset and its desired result is randomized and spliced into 5 folds. This is required to eliminate any biasness towards the data. The splicing process enables different training and testing dataset. Each dataset consist of 2000 instances by which 1600 (80%) instances is used for training and the remaining 400 (20%) instances for testing. From the result shown in Table 3, an average training time ranges 7 to 11 second with testing time less than 0.05 second and mean square error of 0.01 with accuracy no less than 75%. Table 3 indicates the potential of using MLP for emotion verification with accuracy ranging 77% to 88% within 7 to 11 second training time.
5.3. Verification Using ANFIS
5.4. Verification using GenSoFNN
ANFIS offers three approaches to identify cluster namely grid partitioning, subtractive clustering and fuzzy c-means clustering. Grid partitioning approach is useful if the number of features is no more than 6 or 7. If the number of features is too high then this method will cannot be used as the memory requirement will be insufficient when using MATLAB. For fuzzy c-means method, the number of clusters for the dataset needs to be specified. Since no prior knowledge on the number of clusters is available, subtractive clustering will be ideal. The radii of the clusters will be used as the basis of the cluster formation.
Similar experiments are also carried out using GenSoFNN architecture based on the fuzzy inference scheme of Compositional Rule of Inference (CRI), proposed by Zadeh [27] were shown in Table 5. Several values are set for the purpose of initialization such as 0.01 for Root Means Square Error (RMSE) and training algorithm of backpropagation.
A range of radii from 1.7 to 0.7 has been chosen in order to produce optimum number of clusters for the initial fuzzy inference system (FIS). The initial FIS then was fed to the ANFIS for refining the FIS so that it will tune the supervised learning iteratively with more detailed rules generated. In the experiments RMSE of 0.2 with step size 0.5 were used.
95
93
angry400 angry800
92
angry1200 91
angry1600
Table 5: K-fold Global Training Time and Accuracy Validation Result for Different Emotion Using GenSoFNN. Table 5 clearly shows the potential of using GenSoFNN as the accuracy can reach up to 95%. Unfortunately the training time required is almost three minutes per emotion with RMSE value ranging from 0.02 to 0.13. Since training is only required once GenSoFNN can still be used for realtime application.
angry2000
90
6 Summary, conclusions and future work
89 88 0.77 0.87 0.97 1.07 1.17 1.27 1.37 1.47 1.57 Radii
Figure 2: Performance of ANFIS Using Different Value of Radii in Angry Dataset In comparison to Table 3, Table 4 shows the global training time and accuracy of the SEVS using the ANFIS architecture, with the number of clusters shown in brackets. Based on our observation from Table 4, the best classification accuracy is based on dataset using 2 to 7 clusters. It was also observed that by using ANFIS the accuracy of up to 93% with computation time ranging between 12 to 17 seconds can be achieved. Training time (s) and number Accuracy (%) of cluster Angry Sad Happy Angry Sad Happy 400 14.85(5) 14.27(4) 14.20(5) 92.5 89.3 87.0 800 14.20(5) 14.24(4) 6.70(2) 93.0 89.8 85.3 1200 10.19(4) 14.30(4) 24.90(7) 92.8 87.8 83.8 1600 32.36(6) 14.27(4) 6.74(3) 92.0 89.5 87.5 2000 14.20(4) 6.83(3) 6.81(3) 94.5 89.8 82.8 Average 17.16(5) 12.78(4) 11.87(4) 93.0 89.0 85.3 K-fold number
Table 4: K-fold Global Training Time, Number of Cluster and Accuracy Validation Result for Different Emotion Using ANFIS.
Experimental results using MLP, ANFIS and GenSoFNN have shown great potential of using MFCC for speech emotion verification. In fact, for the three networks, the verification accuracy of more than 77% using all the 40 features can be achieved based on the MFCC method of features extraction. These results provide an indication of the practicality of using MFCC coupled together with either a NN or FNN for speech emotion verification system in real-time mobile application such as PDA and mobile phone. 100 Verification Accuracy
Classification Accuracy
94
Training time (s) and number Accuracy (%) of cluster Angry Sad Happy Angry Sad Happy 400 185.16 183.67 162.23 95.7 91.0 90.3 800 137.42 165.72 156.44 94.5 100 90.5 1200 153.74 137.14 182.33 94.0 92.3 89.3 1600 157.05 168.28 158.78 94.5 94.3 88.0 2000 168.86 182.83 180.59 94.8 91.0 88.7 Average 160.45 167.528 168.07 94.7 93.7 89.4 K-fold number
MLP ANFIS GenSoFNN
90
80
70 Angry
Sad Emotion
Happy
Figure 3: Overall Accuracy of Verification by MLP and ANFIS using 40 features ST-MFCC Figure 3 summarizes the results obtained from the experiments. The GenSoFNN architecture clearly produces the best accuracy than the rest. This is in tune to our understanding that fuzzy neural system is better in handling
vague and non-linear information like the way our brain verify the emotion. However, the drawback of using GenSoFNN is that the training time would takes 10 times longer than ANFIS. This is because of the DIC technique that we adopted for clustering the input data. In this paper, we are using the 40 features extracted from the MFCC for the experiments. This factor gives impact to the need of high training computation time for better accuracy. In order to reduce the computation, subset of the features may be used to cut down the training time with reasonable accuracy for specific application. The method for feature ranking and selection to eliminate the number of features used for SEVS can be investigated for optimum training time versus accuracy for future work.
References [1] K. R. Scherer. Vocal Communication of Emotion: A Review of Research Paradigm, Speech Communication, vol. 40, pp. 227-256, 2003. [2] D. Ververidis and C. Kotropoulos. Emotional Speech Recognition: Resources, Features and Methods, Speech Communication, vol 48, pp. 1162-1181, 2006. [3] B. D. Womack and J. H. L. Hansen. N-channel Hidden Markov Model for Combined Stressed Speech Classification and Recognition, IEEE Trans. Speech Audio Processing, 7 (6), pp. 668-677, 1999. [4] G.. Zhou, J H L. Hansen, and J. F. Kaiser. Non-Linear Feature based Classification of Speech under Stress, IEEE Trans. Speech Audio Processing, 9 (3), pp. 201216, 2001. [5] C. Bishop. Neural Networks for Pattern Recognition: Oxford: Clarendon Press 1997 [6] G.. A. Carpenter and S. Grossberg. The ART of Adaptive Pattern Recognition by a Self-organizational Neural Network, Computer 21(3), pp. 77-88. 1988 [7] L. A. Zadeh. Fuzzy Logic, Computer, Vol. 1, No. 4, pp. 83-93. 1988 [8] W. L. Tung and C. Quek. GenSoFNN : A Generic SelfOrganizing Fuzzy Neural Network. IEEE Trans. Neural Networks, Vol. 13, pp. 1075-1086, 2002. [9] J. S. Jang. ANFIS: Adaptive-network-based Fuzzy Inference Systems. IEEE Trans. System., Man, Cybernatics, Vol 23. pp 665-685, 1993. [10] G.. S. Ng, D. M. Shi, W. Abdul. Application of EFuNN for the Classification of Handwritten Digits,'' International Journal of Computers, Systems and Signals, 2005 [11] L. Zhang. Associative Memory and Fuzzy Neural Network for Speaker Recognition, Honor Year Project Report, School Of Computer Engineering, Nanyang Technological University, 2004 [12] H. Hui, J. H. Li, F. J. Song & J. Widjaja. ANFIS-based Fingerprint Matching Algorithm, Optical Engineering, SPIE-International Society for Optical Engine, Volume: 43, Issue 8, pp. 1814 – 1819, 2004 [13] W. L. Tung, C. Quek, and P. Y. K. Cheng. GenSoEWS: A Novel Neural-Fuzzy based Early Warning
System for Predicting Bank Failures. Neural Networks, Vol. 17. pp. 567-587, 2004 [14] W. L. Tung and C. Quek. GenSo-FDSS: A Neuralfuzzy Decision Support System for Pediatric ALL Cancel Subtype Identification Using Gene Expression Data. Artificial Intelligence In Medicine. Vol. 33, pp. 61-88, 2005. [15] Cochlea, Hearing Loss Association of North Carolina, Accessed on August 2007 [16] J. B. Allen and L. R. Rabiner. A Unified Approach To Short-time Fourier Analysis and Synthesis. Proc. IEEE, vol. 65, pp. 1558-1564, 1977 [17] S. Qian and D.Chen. Joint Time-Frequency Analysis – Methods and Applications. 2006. Prentice Hall (2006) [18] J. W. Cooley and O. W. Tukey, An Algorithm For The Machine Calculation of Complex Fourier Series, Mathematic of Computation. Vol.19, pp. 297–301, 1965. [19] S. B. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transaction On Acoustic, Speech and Signal Processing, Vol. 28 (4). pp. 357 – 366, 1980. [20] M. Slaney. Auditory Toolbox : Version 2. Technical Report #1998-010, Interval Research Corporation, 1998. [21] S. J. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book. Version 2.1. Department of Engineering, Cambridge University, UK, 1995. [22] M. D. Skowronski and J. G. Harris. Exploiting Independent Filter Bandwidth of Human Factor Cepstral Coefficient, in Automatic Speech Recognition. Journal of Acoustical Society of America. 116 (3), pp. 1774-1780, 2004. [23] T. Ganchev, N. Fakotakis and G.. Kokkinakis. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task. 10th International Conference on Speech and Computer (SPECOM 2005). vol. 1, pp 191 – 194. 2005. [24] R. Yager, and D. Filev. Generation of Fuzzy Rules by Mountain Clustering. Journal of Intelligent Fuzzy System. vol. 2, pp. 209-219,1994. [25] E. H. Mamdani. Application of Fuzzy Logic to Approximate Reasoning Using Linguistic Systems. IEEE Trans. Computing. Vol. C-26. pp. 1182-1191, 1977 [26] R. R. Cornelius. The Science of Emotion: Research and Tradition. The Psychology of Emotion. Prentice-Hall, Upper Saddle River, NJ. 1996 [27] L. A. Zadeh. Calculus for Fuzzy Restriction. In Fuzzy Sets and Their Application to Cognitive and Decision Processes. New York Academic, pp. 1-39, 1975.