Emotional Affect Estimation Using Video and EEG Data in Deep Neural Networks Arvid Frydenlund1(B) and Frank Rudzicz1,2 1
Department of Computer Science, University of Toronto, Toronto, ON, Canada
[email protected],
[email protected] 2 Toronto Rehabilitation Institute-UHN, Toronto, ON, Canada
Abstract. We present a multimodel system for independent affect recognition using deep neural networks. Using the DEAP data set, features are extracted from EEG and other physiological signals, as well as videos of participant faces. We introduce both a novel way of extracting video features using sum-product networks, and a unique method of creating extra training examples from data that would have otherwise been lost in downsampling. Deep neural networks are used for estimating the emotional dimensions of arousal, valence, and dominance, along with favourability and familiarity. This work lays the foundation for future work in estimating emotional responses from physiological measurements. Keywords: Affect recognition · Deep neural networks · Emotional analysis · EEG
1
Introduction
Emotion- or affect-recognition is the task of estimating emotional responses to some stimulus. Humans typically recognize emotion by integrating multiple modalities, whether from vocal intonation, perspiration, or facial expressions. Different modalities may also offer different benefits and challenges than one another for affect recognition, both in terms of their descriptive power, but also in terms of how practical or economical they are to capture. Electroencephalographs (EEGs) measure activity in the brain by amplifying electrical brain waves created in psycho-physiological processes such as experiencing an emotion. One motivation for using EEG for affect recognition is its use for augmented human-computer interfaces, especially in instances where physical impairments prevent the use of traditional interfaces. In this work we present a multimodal system for subject-independent affect recognition using deep neural networks and the DEAP data set [9]. Our novel approach is significantly more accurate than two baselines on this task. 1.1
The DEAP Dataset
This work uses the DEAP data set introduced by Koelstra et al. [9]. DEAP is a multi-modal dataset created for affect recognition based on emotional responses c Springer International Publishing Switzerland 2015 D. Barbosa and E. Milios (Eds.): Canadian AI 2015, LNAI 9091, pp. 273–280, 2015. DOI: 10.1007/978-3-319-18356-5 24
274
A. Frydenlund and F. Rudzicz
by participants to music videos. The modalities recorded include 32-channel EEG and various other physiological signals including galvanic skin response, electromyogram, and electrooculography. Some of the data set also includes video of participant faces during stimuli, along with links to the music video stimuli. Emotional response to the music videos is quantified by the participants themselves after watching each video. This quantification is described using a three-dimensional affect space described by Russell [13]. Each dimension ranges from negative to positive which, in DEAP, is represented on a continuous scale of 1 to 9. The first dimension is valence, which is a measure of intrinsic qualities of attractiveness/repulsion or pleasantness/unpleasantness, and ranges from ‘sad’ (1) to ‘happy’ (9). The second dimension is arousal, which is measure of activity or extent ranging from ‘bored’ (1) to ‘excited/alert’ (9). Finally, the third dimension is dominance, which is a measure of the feeling of control one has and ranges from ‘helpless’ (1) to ‘empowered’ (9). Specific emotions are then areas or points in this affect space; for example, sadness would have low valence, low arousal, and low dominance while fear would have low valence, high arousal, and low dominance. Participants were also asked to rate their favourability or how much they ‘liked’ the music video on a 1 to 9 scale and their familiarity with the video on a 1 to 5 scale. 1.2
Related Work
Much of the previous work using DEAP involved classification. For this, classes were formed by partitioning the affect space into categorized regions. Works that only use EEG and the other non-video signals include [4,7,8,19,20]. Wichakam and Vateekul used separate binary classes for valance and arousal [20] where each dimension was bisected and classified separately. Chung and Yoon used both binary and trinary classes and reported 66.6% and 66.4% accuracy for valance and arousal, respectively, for the binary class problem and 53.4% and 51.0% for the three-class problem [4]. Jie et al. partitioned the valance and arousal space into quadrants [7]. They reported a 95% accuracy for subject-dependent classification and 70% for subject-independent classification. They used sequential backwards-selection for feature reduction and used pseudo-inverse linear discriminant analyses for classification. Jirayucharoensak et al. used separate trinary classes and a deep-learning model based on stacked autoencoders [8], and Wang and Shang used separate binary classes and deep-belief networks to obtain values of 60.9% and 51.2% for arousal and valance, respectively [19]. Torres-Valencia et al. used videos of the participants’ faces [17], where they extracted the mean shape of the face using fiducial points. They found that no features derived from the videos survived feature selection given their methods. Both Koelstra et al. [9] and Acar [1] included acoustic features of the music videos themselves, and Acar (et al.) would later also add video features from those data [2]. In both of Acar’s experiments, using features extracted from the music videos improved classifier performance. Favourability was classified by Wichakam and Vateekul [20], and by Wang and Shang [19] with accuracies of 66.8%. and 68.4%, respectively.
Emotional Affect Estimation Using Video
275
Relatively little work has been done using DEAP for regression tasks. Garcia et al. measured autoregression of signals, and passed those as observations into hidden Markov models which were mapped by Fisher kernels into a new space where support vector regression was used to perform inference [6]. They did subject-independent cross-validation and regress on six output variables of negative and positive valance/arousal, pleasant and unpleasant valence, and active and passive arousal. They reported values of approximately 0.68 root mean squared error (RMSE). Similarly, Torres et al. used structured support vector regression to jointly model valence and arousal [18]. They used recursive feature elimination and reported RMSE of around 0.24 and 0.25 for arousal and valence. MAHNOB-HCI [16] is another multi-modal affect recognition data set used by [10] in a regression task using both EEG and video features. They converted arousal and variance to the range [0-1] and reported mean square error (not RMSE) as low as 0.063 for arousal and 0.069 for valence.
2
Preprocessing and Feature Extraction
In our experiments, we preprocess features extracted from EEG and other physiological signals in addition to the associated facial videos. The DEAP dataset consists three parts, only two of which are used here: 1) the EEG and other physiological signals; 2) the videos of the participants’ faces, and; 3) the music videos the participants viewed, which are not used in this work. While DEAP consists of 32 subjects, there is only facial video data from 22 of these, which constitutes the subset used in this work. Details can be found in [9]. 2.1
Physiological Signal Preprocessing
The first 25 seconds and the last second of data were removed. For each subject, each trial was z-score normalized and then test and training partitions were created. A leave-two-out partitioning was employed with one of the partitions used as a validation set. It is common to down-sample the data down to around 1/4 of the sampling rate [4,6], which unfortunately throws out a lot of data. Instead, we create new trials from the samples which would have otherwise been discarded. Specifically, let 1/n be the fraction of data one wishes to retain per trial. For simplicity, assume each trial consists of only a single signal, s, which is a vector of N samples, and that n evenly divides N . The trial for s is therefore replaced by n new trials, s1 ...sn by including every nth sample offset by 0...(n − 1). For example, down-sampling s by 1/2 gives two sequences using every second sample, the first starting at s[0] and the other at s[1]. This process increases the number of trials to train a prediction model where each resulting trial appears as a down-sampled but unique version of some true trial. This may potentially prevent or decrease over-fitting by decreasing data sparseness. This is similar to a method used in machine vision where multiple instances of an image that have been rotated and/or rescaled are included in training in order to make the model
276
A. Frydenlund and F. Rudzicz
scale- or rotation-invariant. Each image in that case acts as a noisy version of some true image in the correct scale and position. Our method can be seen as adding slight temporal invariance, which might be important if subjects have similar reactions to the music videos but at slightly different times. Using this scheme, the data were down-sampled to 1/5 of the original rate. For each partition, independent component analysis (ICA) was applied to the set of EEG channels. Ocular artifacts were removed using EEGLAB [5]. A bandpass filter was applied to the EEG channels to get the theta band (4-8 Hz), slow alpha (8-10 Hz), alpha (8-12 Hz), beta (12-30 Hz), and gamma (30-100 Hz) [9]. Galvanic skin response (GSR) was smoothed using a 256-point moving average, as in [9]. The non-EEG physiological channels were bandpass filtered on [0.1-0.2] Hz for skin temperature, for the multiple ranges of [0.01-0.08] Hz, [0.080.15] Hz, and [0.15-0.5] Hz for blood pressure, and [0.05-0.25] Hz and [0.25-5] Hz for the respiration channel. All the original channels, along with the bandpass-filtered channels, were used for feature extraction. For each channel, each trial was split into non-overlapping semi-equal time segments and the following features were extracted: average, standard deviation, range for the segment, for the given signal and the first two derivatives of the signal. For GSR, the average of negative values of the first derivative, and proportion of negative derivatives to positive ones were also found, as in [9]. The non-overlapping time segments were set empirically to the whole trial, 1/4, 1/10, and 1/20 of the trial. 2.2
Video Preprocessing
Videos were preprocessed by finding the bounding square that contains the face for each frame, converting to grey-scale, and then resizing the image to 64 × 64 pixels. Images were then z-scored over each trial independently. For each partition described above, the training set was divided into classes based on the quadrants formed by dividing valence into high and low and arousal into high and low. Figures 1 and 2 show these class boundaries as red lines. Arousal was split into at 4.5, and valence was split at 5.5. These class boundaries where chosen so that classes represented semi-equal areas of affect space while minimizing the distance of sample points to the class boundaries. Dominance was not considered, as it would have created more classes thereby decreasing the amount of training data per class. These classes are used to produce four video-based features for each image frame, i.e., the log-likelihood of an image belonging to one of those classes, as determined by trained sum-product networks (SPNs). SPNs, introduced by Poon and Domongos [12], are graphical models in which exact and tractable inference is achieved by placing limitations on the model’s structure. In this paper, each SPN represents a distribution over faces in different quadrants of the two-dimensional affect space. This is done by training four SPN models, one for each of the valence-arousal classes in figure 1, using the same graph structure as in previous work [12]. The average and standard deviations of the log-likelihood values for each of the four SPNs, over the previously
Emotional Affect Estimation Using Video
277
described time segments, are then used as features in a larger neural-network based regression engine. Scatter and density plots of arousal and valence across trials 9 9 8
Arousal
7
Arousal
6 5 4 3 2 1 1
2
3
4
5
6
7
8
9
Valence
Fig. 1. Class boundaries of the four SPN models, shown as red lines
1 1
9 Valence
Fig. 2. The density of arousal and valence, indicating that much of the data is on integer values in the range of [1-9] for both
Other features are extracted from the facial videos using principal component analysis. To extract features, we calculate the std of images in the time segments defined above, mapped to the 1st through 20th principal components. Finally, feature selection is run over each channel by correlating the channel with the variable we wish to estimate, e.g., valence etc. and taking the top number of correlating features. Figure 3 shows an overview of the data processing pipeline for inference.
3
Experiments
The affect and favourability dimensions are rounded to whole numbers since much of the data already falls on integer values, as shown in figure 2. The RMSE introduced between the rounded values and the original values over all the data (on a 1 to 9 scale) is: 0.1849 for valence, 0.1799 arousal, 0.1763 dominance, and 0.1798 favourability. Familiarity is already on an discrete scale of 1 to 5. Linear regression and deep neural networks (DNNs) are used for the regression tasks. For every test partition, a linear regression and a neural network model are created for each affect dimension, favourability, and familiarity. The neural network consists of two hidden layers, a sigmoidal activation function for interior layers, and softmax for the output layer. The number of features ranges empirically from 200 to 700, depending on the regressed variable. A combination of Spearmint [15] and hand-tuning is used to set the neural network’s hyper-parameters. The same data are used to train both the neural networks that produce our output regressions, and the SPNs.
278
A. Frydenlund and F. Rudzicz
Test Data
Physiological Signal Data Preprocessing
Video Data Preprocessing
SPN SPN SPN SPN
PCA
Feature Reduction
Neural Net
Fig. 3. Flowchart of the inference pipeline
Baselines for the models are formed by plurality classification. Table 1 presents RMSE means ± std. dev. across the test partitions in both the [1-9] and the [0-1] ranges for each affect dimension. This is done as RMSE is dependent on the scale of the values and allows for potential comparison with other work, such as that of Koelstra and Patras [10]. Table 2 shows the RMSE for the favourability and familiarity regressed variables. For an α = 0.05, a one-sided t-test was conducted between the neural net results and the baseline, as shown in table 3.
4
Discussion
Our proposed model significantly out-performs baselines across affect, dominance, and favourability, and is merely more accurate for valence. Our model is not significantly different than the baseline for familiarity. Linear regression failed to get better results than the other baseline in all cases. This work acts as a feasibility study for our ‘downsampling extension’ scheme, and further work needs to be done to examine its affect. Feature selection is another possible line of inquiry, since we have selected features based on linear relations, where non-linear relations may be more appropriate for DNNs. Emotion is an essential part of communication and, as computer systems become increasingly sophisticated, emotion recognition could be incorporated as part of natural language applications. For example, effective speech recognition for emergency response and medical consultation may be improved by accommodating emotional speech [14]. Affect recognition from physiological signals will also be important for applications such as recommendation systems [3], which will complement ongoing applications of sentiment analysis. This work provides a multimodal step in that direction.
Emotional Affect Estimation Using Video
279
Table 1. RMSE ± standard deviation n both a [1-9] range of possible values and a [0-1] range across the test partitions for each of the regressed variables in the affect space Model
Scale [1-9] Baseline [0-1] [1-9] Logistic Regression [0-1] [1-9] Deep Neural Net [0-1]
Arousal 1.9062 ± 0.4051 0.2383 ± 0.0506 3.8275 ± 0.6686 0.4784 ± 0.0836 1.5778 ± 0.5301 0.1972 ± 0.0663
Valence 1.7795 ± 0.2462 0.2224 ± 0.0308 3.5465 ± 0.8793 0.4433 ± 0.1099 1.5396 ± 0.8797 0.1925 ± 0.1100
Dominance 2.6591 ± 0.2557 0.3324 ± 0.0320 3.9718 ± 0.6367 0.4964 ± 0.0796 2.0181 ± 0.4489 0.2523 ± 0.0561
Table 2. RMSE ± standard deviation in both a [1-9] range and for [0-1] range across the test partitions for favourability and familiarity Model
Scale [1-9] Baseline [0-1] [1-9] Logistic Regression [0-1] [1-9] Deep Neural Net [0-1]
Favourability 2.7618 ± 0.3995 0.3452 ± 0.0499 4.0557 ± 0.5482 0.5070 ± 0.0685 2.2366 ± 0.5975 0.2796 ± 0.0747
Familiarity 2.0072 ± 0.4065 0.5018 ± 0.1016 2.3883 ± 0.31994 0.5971 ± 0.0800 2.0432 ± 0.4010 0.5108 ± 0.1002
Table 3. Results of one-sided t-tests conducted between the neural network results and the baseline, specifically the t-statistic, p-value, and confidence interval
t(40) P CI
Arousal -2.2557 0.014814 -0.08326
Valence -1.2034 0.11795 0.09578
Dominance -5.6866 0.000007 -0.45122
Favourability -3.3486 0.0008896 -0.26111
Familiarity 0.2887 0.61285 0.24577
Acknowledgments. This paper makes use of the following software: Deep Learning Toolbox [11], Spearmint [15], EEGLAB [5], and modified SPN code from [12].
References 1. Acar, E.: Learning representations for affective video understanding. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 1055–1058. ACM (2013) 2. Acar, E., Hopfgartner, F., Albayrak, S.: Understanding affective content of music videos through learned representations. In: Gurrin, C., Hopfgartner, F., Hurst, W., Johansen, H., Lee, H., O’Connor, N. (eds.) MMM 2014, Part I. LNCS, vol. 8325, pp. 303–314. Springer, Heidelberg (2014) 3. Chen, Y.A., Wang, J.C., Yang, Y.H., Chen, H.: Linear regression-based adaptation of music emotion recognition models for personalization. In: Acoustics, Speech and Signal Processing (ICASSP) (2014)
280
A. Frydenlund and F. Rudzicz
4. Chung, S.Y., Yoon, H.J.: Affective classification using bayesian classifier and supervised learning. In: 2012 12th International Conference on Control, Automation and Systems (ICCAS), pp. 1768–1771. IEEE (2012) 5. Delorme, A., Makeig, S.: EEGlab: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of neuroscience methods 134(1), 9–21 (2004) 6. Garcia, H.F., Orozco, A.A., Alvarez, M.A.: Dynamic physiological signal analysis based on fisher kernels for emotion recognition. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 4322–4325. IEEE (2013) 7. Jie, X., Cao, R., Li, L.: Emotion recognition based on the sample entropy of EEG. Bio-medical materials and engineering 24(1), 1185–1192 (2014) 8. Jirayucharoensak, S., Pan-Ngum, S., Israsena, P.: EEG-based emotion recognition using deep learning network with principal component based covariate shift adaptation. The Scientific World Journal 2014 (2014) 9. Koelstra, S., Muhl, C., Soleymani, M., Lee, J.S., Yazdani, A., Ebrahimi, T., Pun, T., Nijholt, A., Patras, I.: Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3(1), 18–31 (2012) 10. Koelstra, S., Patras, I.: Fusion of facial expressions and EEG for implicit affective tagging. Image and Vision Computing 31(2), 164–174 (2013) 11. Palm, R.B.: Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark (2012) 12. Poon, H., Domingos, P.: Sum-product networks: a new deep architecture. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 689–690. IEEE (2011) 13. Russell, J.A.: A circumplex model of affect. Journal of personality and social psychology 39(6), 1161 (1980) 14. Shneiderman, B.: The limits of speech recognition. Communications of the ACM 43(9), 63–65 (2000) 15. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012) 16. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing 3(1), 42–55 (2012) 17. Torres, C.A., Orozco, A.A., Alvarez, M.A.: Feature selection for multimodal emotion recognition in the arousal-valence space. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 4330–4333. IEEE (2013) 18. Torres-Valencia, C.A., Alvarez, M.A., Orozco-Gutierrez, A.A.: Multiple-output support vector machine regression with feature selection for arousal/valence space emotion assessment. In: 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 970–973. IEEE (2014) 19. Wang, D., Shang, Y.: Modeling physiological data with deep belief networks. International Journal of Information and Education Technology 3 (2013) 20. Wichakam, I., Vateekul, P.: An evaluation of feature extraction in EEG-based emotion prediction with support vector machines. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 106–110. IEEE (2014)