Convolution Neural Networks

2 downloads 0 Views 3MB Size Report
Apr 30, 2018 - using a convolution neural networks (CNN) model and to suggest an ... and the peripheral (PNS) nervous systems are hard for humans to.
sensors Article

Electroencephalography Based Fusion Two-Dimensional (2D)-Convolution Neural Networks (CNN) Model for Emotion Recognition System Yea-Hoon Kwon, Sae-Byuk Shin and Shin-Dug Kim * Department of Computer Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Korea; [email protected] (Y.-H.K.); [email protected] (S.-B.S.) * Correspondence: [email protected]; Tel.: +82-02-2123-2718 Received: 20 March 2018; Accepted: 26 April 2018; Published: 30 April 2018

 

Abstract: The purpose of this study is to improve human emotional classification accuracy using a convolution neural networks (CNN) model and to suggest an overall method to classify emotion based on multimodal data. We improved classification performance by combining electroencephalogram (EEG) and galvanic skin response (GSR) signals. GSR signals are preprocessed using by the zero-crossing rate. Sufficient EEG feature extraction can be obtained through CNN. Therefore, we propose a suitable CNN model for feature extraction by tuning hyper parameters in convolution filters. The EEG signal is preprocessed prior to convolution by a wavelet transform while considering time and frequency simultaneously. We use a database for emotion analysis using the physiological signals open dataset to verify the proposed process, achieving 73.4% accuracy, showing significant performance improvement over the current best practice models. Keywords: EEG; GSR; emotion recognition; deep learning; convolution neural networks; pattern recognition; hybrid neural network

1. Introduction Multimodal human and computer interaction (HCI) has been actively researched over the last few years. One outstanding issue is affective computing, designing devices that communicate with humans by interpreting emotions. Emotion recognition has been attracting attention as a next-generation technology in many fields, from the development of humanistic robots to consumer analysis and safe driving. Most previous research has classified emotions using only facial expressions. However, facial expressions only represent part of the overall human emotional response, and emotion discriminators can sometimes make significant mistakes. For example, classifying an athlete's image as displaying happy emotion, when actually the smiling athlete is nervous prior to an important game [1]. On the other hand, biological signals from the central (CNS) and the peripheral (PNS) nervous systems are hard for humans to mentally control, and can accurately represent emotions. Previous studies have shown that changes in skin signals (i.e., galvanic skin response (GSR)) are closely related to changes in peripheral nerves with emotional changes [2], and electroencephalogram (EEG) signals from the frontal lobe are strongly related to emotional changes [3,4]. Therefore, the current study classified emotions using biological signals, including EEG and GSR. Electroencephalogram signals are used in brain computer interface research, measuring brain electrical activity using an electrode that is attached to the scalp. However, reduced accuracy due EEG signal instability remains a major problem, and EEG signals are untrustworthy, even when they are employing expensive and reliable equipment. The solution is to use as many heterogeneous sensors Sensors 2018, 18, 1383; doi:10.3390/s18051383

www.mdpi.com/journal/sensors

Sensors 2018, 18, 1383

2 of 13

Sensors 2018, 18, x FOR PEER REVIEW

2 of 13

as possible to provide reliable multiple data. Therefore, we designed a data adaptive CNN model sensors as possible to provide reliable multiple data. Therefore, we designed a data adaptive CNN to improve the emotion classification accuracy, reducing current model instabilities, using EEG and model to improve the emotion classification accuracy, reducing current model instabilities, using GSR data. We also implemented effective spectrogram feature extraction and designed a multimodal EEG and GSR data. We also implemented effective spectrogram feature extraction and designed a classifier that takes two features as input at the first layer of a fully connected network. multimodal classifier that takes two features as input at the first layer of a fully connected network. This paper is organized as follows. Section 2 discusses previous research methodologies and results. This paper is organized as follows. Section 2 discusses previous research methodologies and Section 3 discusses the current paper’s main contributions, including the details of label processing, results. Section 3 discusses the current paper’s main contributions, including the details of label EEG signal transformation, GSR data feature extraction, and introduces the proposed CNN model processing, EEG signal transformation, GSR data feature extraction, and introduces the proposed architecture and training strategy. Section 4 analyzes the results and compares them with the current best CNN model architecture and training strategy. Section 4 analyzes the results and compares them practice models. Finally, Section 5 summarizes and concludes the paper. with the current best practice models. Finally, Section 5 summarizes and concludes the paper. 2. Related Work 2. Related Work The related research fields of emotion classification and EEG preprocessing have achieved The related research fieldspreprocessing of emotion classification and EEG preprocessing have achieved remarkable results. In general, EEG data consists of selecting data while considering remarkable results. In general, preprocessing EEG data consists of selecting data while considering the frequency and the location of the brain. Fast Fourier transform (FFT) is the most common the frequency and method the location of EEG the brain. Fast and Fourier transform (FFT)toisextract the most frequency analysis for raw data [5–9], it was adopted here EEGcommon features. frequency analysis method for raw EEG data [5–9], and it was adopted here to extract EEG However, FFT cannot reflect temporal information in the frequency data, requiring additional features. methods However, FFT cannotover reflect temporal information the frequency requiring additional to recognize emotions time. Therefore, short timein Fourier transformdata, (STFT), which can express methods toper recognize emotions time.to Therefore, short time Fourier transform (STFT), which can frequency hour [10–13], was over also used analyze EEG signals. express frequency perfeatures hour [10–13], was alsoisused to analyze EEGmethod signals.to differentiate alpha, beta, Classifying EEG by frequency the most common Classifying EEG features by frequency is the most common method to differentiate alpha, beta, theta, and gamma waves. Liu et al. [14] presented a table of emotions by frequency and electrode location theta, and gamma waves. Liu1etshows al. [14] table of emotions byattached frequency andscalp electrode within the brain region. Figure thepresented location ofathe electrodes that are to the using location within the brain region. Figure 1 shows the location of the electrodes that are attached to the the 10-20 system, which is the international standard. Electrodes F3 and F4 distinguish between negative scalppositive using the 10-20 system, is and the international standard. and F4surrounding distinguish and emotional states, which and AF3 AF4 distinguish positiveElectrodes emotions F3 from the between negative and positive emotional states, and AF3 andAF4 distinguish positive emotions emotions. Wavelet analysis is one of the best ways to express frequency and time, and has also from been the surrounding emotions. Wavelet analysis is one of the best ways to express frequency and time, employed in EEG classification [15–18]. and has also been employed in EEG classification [15–18].

Figure 1. 1. The The 10-20 10-20 system system the the international international standard standard and and location of the the electrodes. electrodes. Figure location of

Various previous studies considered emotional classification methods. Mollahosseini et al. [19] recognition module. module. Gerard designed a CNN based face recognition Gerard Pons Pons et et al. al. [20] enhanced facial image classification performance by supervised hierarchical learning. Ding Ding et al. [21] performed deep face recognition that thatwas wasbased basedon ona atwo two steps model. Poria et al. implemented multimodal visual recognition steps model. Poria et al. [22][22] implemented multimodal visual and and audio data analysis beyond the focus on text-based emotional analysis. They also succeeded in feature fusion through deep learning based heterogeneous data dimension reduction.

Sensors 2018, 18, 1383

3 of 13

audio data analysis beyond the focus on text-based emotional analysis. They also succeeded in feature 2018, 18, x FOR PEER REVIEW 3 of 13 fusionSensors through deep learning based heterogeneous data dimension reduction. The Database for Emotion Analysis using Physiological signals (DEAP) dataset has been widely The Database for Emotion Analysis using Physiological signals (DEAP) dataset has been widely employed for emotion classification models using biomedical signals. Koelstra et al. [23] used the DEAP employed for emotion classification models using biomedical signals. Koelstra et al. [23] used the data set to classify PNS and CNS sensor data, and measured the emotional classification performance. DEAP data set to classify PNS and CNS sensor data, and measured the emotional classification Liu and Sourina [24] EEG valence for real-time Naser etNaser al. [25] predicted performance. Liu studied and Sourina [24] studiedlevels EEG valence levels forapplications. real-time applications. et al. [25] emotions extracted from music videos. Chen et al. [26] applied ontology and datamining techniques predicted emotions extracted from music videos. Chen et al. [26] applied ontology and datamining for EEG based emotion analysis. techniques for EEG based emotion analysis. Bayesian networks, unsupervised deep running, deep belief have networks haveapplied also been Bayesian networks, unsupervised deep running, and deepand belief networks also been [27–29]. applied [27–29].

3. Methods 3. Methods

3.1. Multiple Label Classification 3.1. Multiple Label Classification

A label was constructed using the self-assessment value that was provided in the DEAP dataset, label was constructed using the self-assessment value that was provided thetypically DEAP dataset, including A valence, arousal, dominance, liking, and familiarity. Emotional statesinare evaluated including valence, arousal, dominance, liking, and familiarity. Emotional states are typically using arousal and valence, and are divided into four sections: high arousal, high valence (HAHV); evaluated using arousal and valence, and are divided into four sections: high arousal, high valence high arousal, low valence (HALV); low arousal, low valence (LALV); and, low arousal, high valence (HAHV); high arousal, low valence (HALV); low arousal, low valence (LALV); and, low arousal, high (LAHV) [30], as shown in as Figure 2. in Thus, emotional can states be classified according to arousal valence (LAHV) [30], shown Figure 2. Thus, states emotional can be classified according to and valence levels. arousal and valence levels.

Figure 2. Arousal, valence two-dimension plane. Figure 2. Arousal, valence two-dimension plane. Labeling based a thresholdvalue valuefor for the the two-dimensional plane. We We implemented Labeling waswas based on on a threshold two-dimensional(2D) (2D) plane. implemented k-means clustering on self-assessed arousal and valence levels to find the most appropriate threshold. k-means clustering on self-assessed arousal and valence levels to find the most appropriate threshold. Previous studies have employed one shot encoding for labeling as a 2D vector, i.e., [HV, LV] and [HA, Previous studies have employed one shot encoding for labeling as a 2D vector, i.e., [HV, LV] and [HA, LA] LA] using k-means clustering with k = 2 [31]. Therefore, we performed independent valence and using k-means clustering with k = 2 [31]. Therefore, we performed independent valence and arousal arousal classifications in order to compare with previous models. classifications in order to compare with previous However, independent classification fails to models. consider arousal and valence correlations, and since However, independent classification fails to consider and valence correlations, andtosince the data is arousal and valence levels, rather than emotion arousal level, it cannot be implemented for end the data arousal and rather than level, it cannot be implemented end to end endislearning, sincevalence it must levels, be mapped onto theemotion two-dimensional (2D) plane (Figure 2) forfor emotion learning, since it must be mapped onto the two-dimensional (2D) plane (Figure 2) for emotion judgment. judgment.

Sensors 2018, 18, 1383 Sensors 2018, 18, x FOR PEER REVIEW

4 of 13 4 of 13

Therefore, Therefore,we wepropose proposek-means k-meansclustering clusteringwith withk k= =4 4totoprovide providea afour-dimensional four-dimensional(4D) (4D)label label vector. and kk == 4. 4. Point Point(5, (5,5)5)isisthe theapproximate approximatecenter center mean vector.Figure Figure33compares compares clustering clustering for k == 2 and mean for Sensors 2018, 18, x FOR PEER REVIEW 4 of 13 for both 2 and k =4,4,hence hencewe we use the threshold. both k =k 2= and k= use (5,(5, 5)5) asas the threshold. Thus, labeling included 2D and 4D vectors through shot for Thus, labeling included 2D and 4D vectors through one shotencoding forlearning. learning. Therefore, we propose k-means clustering with k = 4 toone provide aencoding four-dimensional (4D) label vector. Figure 3 compares clustering for k = 2 and k = 4. Point (5, 5) is the approximate center mean for both k = 2 and k = 4, hence we use (5, 5) as the threshold. Thus, labeling included 2D and 4D vectors through one shot encoding for learning.

(a)

(b)

(a) clustering (b)data (a) Clustered Figure clustering resultsresults of Arousal-Valence self-assessment Figure3. K-means 3. K-means of Arousal-Valence self-assessment data (a) ArousalClustered Valence data result when k = 2; and, (b) Clustered Arousal-Valence data result when k = 4. Arousal-Valence data result when k = 2; and, (b) Clustered Arousal-Valence data result when k = 4. Figure 3. K-means clustering results of Arousal-Valence self-assessment data (a) Clustered ArousalValence data result when k = 2; and, (b) Clustered Arousal-Valence data result when k = 4.

3.2. 3.2.EEG EEGSignal SignalTransformation TransformationtotoTime TimetotoFrequency FrequencyAxes Axes 3.2. EEG Signal Transformation to Time to Frequency Axes

The Thedata datawas waspreprocessed preprocessedtotoreflect reflectEEG EEGtemporal temporaland andfrequency frequencycharacteristics. characteristics.Since Sincethe theEEG EEG The data was preprocessed to reflect EEG temporal and frequency characteristics. Since the EEG data measuring human emotions are time series data, time information must be reflected in the data measuring human emotions are time series data, time information must be reflected in the frequency data measuring human emotions are time series data,used time to information must be reflected in the data frequency data. Although the STFT has been widely add time information to frequency data. Although the STFT has been widely used to add time information to frequency data [10–13], frequency data. Although thefor STFT has been widely used to add time information to frequency data as the [10–13], it has disadvantages time-frequency analysis, in that temporal resolution decreases it[10–13], has disadvantages for time-frequency analysis, in that temporal resolution it has disadvantages for time-frequency analysis, in that temporal resolutiondecreases decreasesas as the the window window increases; and, frequency resolution decreases as window size decreases. increases; and, frequency resolution decreases as window sizesize decreases. window increases; and, frequency resolution decreases as window decreases. Therefore, we using a awavelet transform totorepresent the axis, using the open Therefore, wepropose proposeusing using wavelet transform represent thefrequency frequency axis, Therefore, we propose a wavelet transform to represent the frequency axis, using theusing open the open toolbox EEG lab. The extracted spectrogram was 42 × 200 pixel, width × height, where width toolbox was42 42 ××200 200pixel, pixel,width width× × height, where width pixel) toolboxEEG EEGlab. lab. The The extracted extracted spectrogram spectrogram was height, where width (200(200 (200 pixel) represents time, and height (42 pixel) represents EEG sensor frequency (4.0–45 Hz), as shown pixel) represents time, and height (42 pixel) represents sensor frequency (4.0–45 as shown represents time, and height (42 pixel) represents EEG EEG sensor frequency (4.0–45 Hz),Hz), as shown in Figure 4. inTotal Figure 4.4.Total data include 40960 spectrograms. Atnumber this time, the number of batchis in Figure Totaltransformed transformed include 40960 spectrograms. Atthe this time, the of batch transformed data includedata 40960 spectrograms. At this time, ofnumber batch data for training trainingis is32 32spectrogram spectrogram data that means 32 were electrodes that from werewere derived from one stimulus. data forfor training data that means 32 electrodes that derived from one stimulus. 32data spectrogram data that means 32 electrodes that derived one stimulus. Therefore, the total Therefore, thetotal totalamount amount of data set used inin this study is 1280, withwith data data labels,labels, as shown in Tables Therefore, the of data set used this study is 1280, as shown amount of data set used in this study is 1280, with data labels, as shown in Tables 1 and 2. in Tables 1 and 1 and 2. 2.

Figure 4. Wavelet transformed spectrogram for each electrode. Figure 4. Wavelet transformed spectrogram for each electrode.

Conventional EEG based emotion classification analyzes degree of activity in a area specific area Figure 4. emotion Wavelet transformed spectrogram forthe each Conventional EEG based classification analyzes the degree ofelectrode. activity in a specific ofofthe brain (e.g. the frontal lobe), using electrodes that were attached to the head close to the frontal the brain (e.g. the frontal lobe), using electrodes that were attached to the head close to the frontal Conventional EEG based emotion classification analyzes degree of activity a were specific area lobe and some some other lobes (e.g., AF3, AF4, P7).P7). Frequency bandsbands forthe specific electrodes wereintypically lobe and other lobes (e.g., AF3, AF4, Frequency for specific electrodes typically of the brain (e.g. the frontal lobe), using electrodes that were attached to the head close to the frontal lobe and some other lobes (e.g., AF3, AF4, P7). Frequency bands for specific electrodes were typically

Sensors 2018, 18, 1383

5 of 13

subdivided into alpha, beta wave, gamma, etc. waves to allow for simple and shallow classification models, such as support vector machines (SVMs). However, sensor selection and subdivision ignores emotion related signal changes in other brain regions. Recent advanced deep learning techniques can improve emotional analysis accuracy by incorporating all sensor data for each experiment. Table 1. The number of extracted wavelet transformed data for two types of labels.

1

Label 1

Data quantity

HAHV HALV LAHV LALV Total

458 294 255 273 1280

H: high, L: low, V: valence, A: arousal, e.g., HAHV: high arousal, high valence.

Table 2. The number of extracted wavelet transformed data for four types of labels. Label 1

Data quantity

Label 2

Data quantity

HA LA Total

752 528 1280

HV LV Total

713 567 1280

1, 2

H: high, L: low, V: valence, A: arousal, e.g., HA: high arousal.

3.3. GSR Preprocessing Using Short Time Zero Crossing Rate To extract the feature, we divide the GSR waveform into defined windows and calculate the short time zero crossing rate (STZCR), i.e., the number of times the signal crosses zero within a given window. That is, we intend to use the change in amplitude of the GSR as the input feature vector for deep running. STZCR indicates the rate of signal change, STZCR =

1 N

m

∑ n

|sig{s(n)} − sig{s(n − 1)}| w(m − n) 2

(1)

where N is the sampled signal, and w represents the window. We highlighted features using the extracted zero crossing rate vector with threshold T=

∑ GSRstzcr Nstzcr

(2)

where GSRstzcr is a vector column and Nstzcr is the number of vectors. If the data is greater than the threshold, it outputs max, otherwise it outputs zero. GSR amplitude is generally sensitive to arousal changes and less sensitive to valence changes, hence, it can positively affect EEG features to focus on arousal in the classifier model. 3.4. Fusion Convolution Neural Network Model for EEG Spectrograms and GSR Features Many neural networks have been developed for classification in recent studies. The first thing to consider when designing a CNN is data characteristics. Therefore, we designed the CNN to use the spectrogram image from the wavelet transformation of all the channels. Tabar and Halici [32] considered CNN classification problems using EEG spectrograms, and designed a single layer CNN using one-dimensional filtering to provide good classification performance based on motor imagery EEG signals. However, a single filtering through the single convolutional layer does not efficiently extract features for emotion classification, since it is not deep enough to extract emotion data.

Sensors 2018, 18, 1383

6 of 13

Therefore, we propose a neural network based on the extracted data as described above, which allows forSensors deep2018, convolution layers, while also reflecting temporal effects, as shown in Figure6 of 5.13 We first 18, x FOR PEER REVIEW normalized the data, making the cost function a spherical contour, and helping to increase the learning first We normalized the data,amaking the cost function contour, helping to increase the rate. then designed deep convolution layera spherical that reflects time,and using a3× 2 filter rather than learning rate.square We then designed layerspectrogram that reflects time, using aper 3 ×hour 2 filter rather conventional filters, suchaasdeep 2 ×convolution 2 or 3 × 3. The frequency can be reflected conventional square filters,Since such the as 2filter × 2 oris3a×feature 3. The spectrogram frequency can be bythan increasing the filter height. identifier that extractsper thehour information from reflected by increasing the filter height. Since the filter is a feature identifier that extracts the the manifold state, the shape of the filter is related to the content of the feature to be extracted from information from the manifold state, the shape of the filter is related to the content of the feature to the receptive field. Our proposed filter can identify data in a region that is relatively longer than a square be extracted from the receptive field. Our proposed filter can identify data in a region that is relatively filter. Thus, the data containing the vertical meaning is repeatedly transmitted to the input of the next layer. longer than a square filter. Thus, the data containing the vertical meaning is repeatedly transmitted Astoathe result, perAs hour of thethe spectrogram image canthe bespectrogram learned in image CNN.can Setting inputthe of frequency the next layer. a result, frequency per hour of be stride = [2, 1] with no padding, the filter be extracted based only can on the image time base. learned in CNN. Setting stride = [2,can 1] with no padding, the filter be extracted based onlyWe onused the a fully connected layer final classification. Theforclassifier trained onThe theclassifier spectrogram features of image time base.for Wethe used a fully connected layer the final is classification. is trained the spectrogram features of 32CNN. electrodes extracted through CNN. continuous training, thepatterns 32on electrodes extracted through In continuous training, the In classifier learns similar classifierfrom learns32similar patterns extractedand fromcan 32 be individual electrodes, can be the classified as a layer. extracted individual electrodes, classified as a labeland through last softmax label through the consists last softmax layer.convolutional The entire model consists four convolutional layers and seven The entire model of four layers andofseven fully connected layers. fully connected layers.

Figure 5. Proposed convolution neural network combining electroencephalogram (EEG) and wavelet Figure 5. Proposed convolution neural network combining electroencephalogram (EEG) and wavelet transformed skinresponse response (GSR). transformed galvanic galvanic skin (GSR). Batchnormalization normalization [33] before eacheach value was passed to the activation function,function, Batch [33]was wasperformed performed before value was passed to the activation except for the last convolutional layer, in order to prevent the model gradient vanishing during except for the last convolutional layer, in order to prevent the model gradient vanishing during training. training. It has the effect of preventing internal covariance shift by reducing activation function It has the effect of preventing internal covariance shift by reducing activation function variation that is variation that is caused by the previous layer’s variation. Batch normalization was implemented, as caused by the previous layer’s variation. Batch normalization was implemented, as follows. follows. (1) Normalize the batch data using the batch mean, 𝜇𝛽 , and variance, 𝜎𝛽22,

(1)

Normalize the batch data using the batch mean, µ β , and variance, σβ , 𝑥̂𝑖 =

𝑥𝑖 − 𝜇𝛽

x − µβ 𝜎𝛽2 q + i𝜖 xˆi√= σβ2 + e

(3)

(3)

(2) Use the r and d values for scale and shift operations,

(2)

𝑦𝑖 =operations, 𝛾𝑥̂𝑖 + 𝛽 Use the r and d values for scale and shift

yi = γ xˆi + β

(4)

(4)

Sensors 2018, 18, 1383

7 of 13

Updating γ and β by training allows for the CNN to better reflect the model characteristics model in normalized variables, rather than simple normalization, such as whitening. Testing uses average γ and β obtained. Feature maps are generated as the image passes through each convolution layer. The layer activation function is a rectified linear unit (ReLU), which is a function that makes the value of the part where x < 0 in the linear function y = x is 0, ReLU (u) = max (u, 0),    u, i f u > 0 max (u, 0) =   0, otherwise

(5)

The ReLU function is computationally efficient because its activation is not restricted to [−1, 1], as for the hyperbolic tangent function, but is used as it is. Therefore, training speed for large spectrogram images is increased, and outputting 0 prevents overfitting due to training many weights, hence training regularization can be expected. After passing through the final 2 × 2 pooling layer, the image is flattened and combined with GSR. To positively influence EEG data performance classification, GSR data uses the data average as the thresh hold to remove noise. It also reduces the computation burden for training a fully connected network by transmitting a zero value to each neuron’s perceptron. The final layer neuron returns the model’s probability distribution using softmax, and performs classification by changing the number of neurons according to the experimental environment, such as [HV, LV], [HA, LA], or [HVHA, LVHA, LVLA, HVAL]. 3.5. Training Strategy We use maximum likelihood estimation (MLE) in order to train the proposed CNN model. MLE maximizes P(Y | X; θ ) by optimizing θ in the probability model for a given data point X and label Y. Cross entropy is the most commonly used MLE loss function, and it calculates the difference between two probability distributions. Let p( x ) be the actual and q( x ) be the predicted probability distribution for the label. Then, cross entropy, L( p, q), is L( p, q) =

Z

p( x )· ln q( x )dx

(6)

CNN training proceeds by back propagation using the gradient decent. We update the weights using the partial derivative of cross entropy loss L for weight matrix W, W = W − γ·

∂L ∂W

(7)

where z j = ∑ wij oi + b is the sum of inner products, and we calculate the gradient as ∂L ∂p(z j ) ∂z j ∂L = · · ∂W ∂p(z j ) ∂z j ∂wij where

∂L ∂p(z j )

(8)

is the magnitude of the influence of function p on L and p(z j ) is the softmax result.

Generally, to find the optimal training point, we find the bias variance trade off point using validation loss, as shown in Figure 6 for the 4 class case.

Sensors 18, x FOR PEER REVIEW Sensors 2018,2018, 18, 1383

8 of 13 8 of 13

Loss comparison in 4 classes label

Prediction loss

5 4 3 2 1

1 38 75 112 149 186 223 260 297 334 371 408 445 482 519 556 593 630 667 704 741 778

0 Iteration train_loss

validation_loss

Figure 6. Four class loss to find the optimal training point. Figure 6. Four class loss to find the optimal training point.

After 400 iterations, validation loss increases, whereas training continues to decrease. Thus, we can After 400 iterations, validation loss increases, whereas training continues to decrease. Thus, we conclude the model becomes over-fitted beyond 400 iterations, providing the optimal training point. can conclude the model becomes over-fitted beyond 400 iterations, providing the optimal training Test point. data should be applied with this level of iteration to measure model accuracy. Test data should be applied with this level of iteration to measure model accuracy. 4. Results andand Discussions 4. Results Discussions 4.1. Experiment Environment 4.1. Experiment Environment Table 3 shows thethe hardware specificationsforfor experiment. Table 3 shows hardwareand andframework framework specifications thethe experiment. Table specifications. Table 3. 3. Hardware Hardware specifications. CPU

CPU GPU GPU RAM RAM OS OS Frameworks Frameworks

Intel Core i5-6600 Intel Core i5-6600 NVIDIAGeForce GeForceGTX GTX1070 10708GBytes 8GBytes NVIDIA DDR4 DDR416GBytes 16GBytes Ubuntu 16.04. Ubuntu 16.04. Tensorflow1.3 Tensorflow1.3 MATLAB/ EEG toolbox MATLAB/ EEG toolbox

4.2. Dataset 4.2. Dataset TheThe DEAP dataset usedtoto provide bio-signal data, containing DEAP dataset[23] [23] was was used provide bio-signal data, containing CNS andCNS PNS and data. PNS PNS data. PNSdata datacomprised comprised GSR, temperature, respiration, blood volume (by plethysmograph), GSR, skinskin temperature, respiration, blood volume (by plethysmograph), and and electrooculogram electrooculogram (EOG). wasskin theresistance skin resistance of the and middle and forefinger, (EOG). GSRGSR data data was the of the middle forefinger, skin breath change by by emotion, including bodybody tension and irritating fear. fear. skintemperature, temperature,andand breath change emotion, including tension and irritating Plethysmograph measured blood flow changes in the finger. EOG signal was measured by eye Plethysmograph measured blood flow changes in the finger. EOG signal was measured by eye blinking, which is related to anxiety. CNS data was the EEG signal. blinking, which is related to anxiety. CNS data was the EEG signal. Data were collected from 32 subjects for 1 m for each of 40 selected music videos. Data was Data were collected from 32 subjects for 1 m for each of 40 selected music videos. Data was recorded on 48 channels with 512 Hz sampling frequency. We used preprocessed data version of recorded on 48 channels with 512 Hz sampling frequency. We used preprocessed data version of MATLAB and numpy formats that were provided by the DEAP dataset, down-sampled to 128 Hz MATLAB numpy formats that applied. were provided by the DEAP dataset, down-sampled to 128 Hz with with aand 4.0–45 Hz band pass filter a 4.0–45 Hz band pass filter applied. 4.3. Performance Analysis In this section, we analyzed the performance of the model in two ways. In first evaluation, we analyzed the classification performance for each label using hold-out validation. To construct a hold-out validation set, test, verification, and learning datasets were created 1:1:9 ratio for each label, with batch size = 32 to reflect data from one stimulus. Second, for the LOOCV, we constructed the dataset

Sensors 2018, 18, 1383

9 of 13

that was measured by one-video as test set and the other video data as training set. The DEAP dataset consists of a data set for 40 videos per participants. In other words, the second evaluation was performed with 39 video stimuli as training dataset, and the data that was extracted by the other one stimulus was used as a test dataset. The desired ideal model would accurately distinguish data patterns and generalize them even when testing data are considered, i.e., we want to find a model between over and under fitting. The proposed model does not apply L2 regularization to prevent overfitting, because there is a batch normalization layer. In addition, cross entropy loss was measured for each iteration to find the optimal training point, as shown in Figure 6. Table 4 shows the predicted accuracy for methods of label based and video based classification using each validation method. Table 4. Emotion classification accuracy. Results A 1 Clssification Methods Proposed Fusion Model

Two Class Classification Accuracy 3 Arousal Valence 0.7812

Results B 2 Four Class Classification Accuracy 4

0.8125

0.7500

Two Class Classification Accuracy 3 Arousal Valence 0.7656

Four Class Classification Accuracy 4

0.8046

0.7343

1

Results A: label based classification using hold-out validation. 2 Results B: video based classification using leave-out one cross validation. 3 Arousal: HA, LA; Valence: HV, LV. 4 HAHV, HALV, LALV, LAHV.

4.4. Comparison with Existing Models We used two class labels that were commonly adopted in previously studies to compare performance, as measured by arousal and valence classification accuracy for the DEAP dataset. Table 5 shows the performance compared with the existing models measured using the same dataset. The performance of our model is shown by the result of LOOCV in Section 4.3, to validate the generalized performance of the model. Table 5. Two class classification performance. Accuracy Arousal Valence

Model CNS feature based single modality PNS feature based single modality Liu and Sourina Naser and Saha Chen et al. Yoon and Chung Li et al. Wang and Shang Proposed fusion CNN model

[23] [23] [24] [25] [26] [27] [29] [28]

0.6200 0.5700 0.7651 0.6620 0.6909 0.7010 0.6420 0.5120 0.7656

0.5760 0.6270 0.5080 0.6430 0.6789 0.7090 0.5840 0.6090 0.8046

The considered methods used a variety of approaches: Koelstra et al. [23] used CNS and PNS sensors; Liu and Sourina [24] used a fractal algorithm to reflect signal complexity that was based on a threshold value; Naser and Saha [25] extracted features using a dual-tree complex wavelet transform and used SVM for classification; Chen et al. [26] used decision trees; Yoon and Chung [27] used Bayesian and perceptron convergence; and, Wang et al. [28] and Li et al. [29] used deep belief networks to automatically extract features and to classify them. Figure 7 show the proposed model has better performance than all compared models

The considered methods used a variety of approaches: Koelstra et al. [23] used CNS and PNS sensors; Liu and Sourina [24] used a fractal algorithm to reflect signal complexity that was based on a threshold value; Naser and Saha [25] extracted features using a dual-tree complex wavelet transform and used SVM for classification; Chen et al. [26] used decision trees; Yoon and Chung [27] used Bayesian and perceptron convergence; and, Wang et al. [28] and Li et al. [29] used deep belief Sensors 2018, 18, 1383 10 of 13 networks to automatically extract features and to classify them. Figure 7 show the proposed model has better performance than all compared models

0.8046 0.7656

0.609 0.512

0.584 0.642

0.709 0.701

Arousal accuracy

0.6789 0.6909

0.508

0.643 0.662

0.7651

0.627 0.57

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.576 0.62

classificaion accuracy

2-class accuracy comparison

Valence accuracy

Figure 7. Arousal and valence classification accuracy. Figure 7. Arousal and valence classification accuracy.

Although EEG EEG data data is easier to classify into two [34], [34], increasing the number of classes not only Although is easier to classify into classes two classes increasing the number of classes enables learning, but learning, it also includes correlations between arousal andarousal valence. notend-to-end only enables end-to-end but it also includes correlations between andTherefore, valence. we Therefore, we compared proposed model performance against class models. compared the proposed modelthe performance against previous four classprevious models.four Generally, when data Generally, when data quantity is limited, the model accuracy decreases as the number of labels to be quantity is limited, the model accuracy decreases as the number of labels to be classified increases. classified increases. The performance of our model is shown by the result of LOOCV in Section 4.3, The performance of our model is shown by the result of LOOCV in Section 4.3, in order to validate in order to validate the generalized performance of the model. Table 6 shows that the proposed model the generalized performance of the model. Table 6 shows that the proposed model has high has high performance when compared to current models. performance when compared to current models. Table 6. Four class classification performance.

Table 6. Four class classification performance. Model Accuracy M Zubair Model and C Yoon [35] 0.4540 Accuracy N Jadhav et al. [36] 0.4625 M Zubair and C Yoon [35] 0.4540 Hatamikia et al. [37] 0.5515 N Jadhav et al. [36] 0.4625 Martínez-Rodrigo et al. [38] 0.7250 Hatamikia et al. [37] 0.5515 Zhang et al. [39] 0.7162 Martínez-Rodrigo et al. [38] 0.7250 Mei etetal.al. [40] 0.7310 Zhang [39] 0.7162 0.7343 Proposed Mei et al. fusion CNN model [40] 0.7310 Proposed fusion CNN model 0.7343 A variety of approaches were employed in the comparison models: Zubair and Yoon [35] used a discrete wavelet transform, and also applied the mRMR algorithm to enhance the feature A variety of approaches were employed in the comparison models: Zubair and Yoon [35] correlations; Jadhav et al. [36] extracted EEG features using the gray level co-occurrence matrix, and used classified a discreteemotion wavelet transform, also applied theetmRMR enhance the feature using K-nearestand neighbor; Hatamikia al. [37] algorithm used using to nonlinear extraction correlations; Jadhav et al. [36] extracted EEG features using the gray level co-occurrence matrix, and self-organized classification; Martínez-Rodrigo et al. [38] extracted biological signal features and classified emotion using K-nearest neighbor; Hatamikia al. classified [37] usedthe using nonlinear extraction using quadratic sample entropy, performed feature selection,etand extracted features by

and self-organized classification; Martínez-Rodrigo et al. [38] extracted biological signal features using quadratic sample entropy, performed feature selection, and classified the extracted features by SVM; Zhang et al. [39] used wavelet feature extraction that was based on a smoothed pseudo Winger-Ville distribution and classification using SVM; Mei et al. [40] extracted features by constructing a connection matrix of the brain structure, with subsequent classification using CNN. Figure 8 show a bar graph that the proposed model has high performance when compared to the current models.

Sensors 2018, 18, x FOR PEER REVIEW

11 of 13

SVM; Zhang et al. [39] used wavelet feature extraction that was based on a smoothed pseudo WingerVille distribution and classification using SVM; Mei et al. [40] extracted features by constructing a Sensorsconnection 2018, 18, 1383 matrix of the brain structure, with subsequent classification using CNN. Figure 8 show11 a of 13 bar graph that the proposed model has high performance when compared to the current models.

0.7343

0.731

0.7162

0.725

0.5515

0.4625

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.454

classification accuracy

4-class accuracy comparison

Figure 8. Four class classification accuracy. Figure 8. Four class classification accuracy.

5. Conclusions 5. Conclusions This study devised data labeling according to emotion criteria, and proposed a data preprocessing This study devised data labeling according to emotion criteria, and proposed a data methodology to increase the emotional classification performance. Emotion classification performed preprocessing methodology to increase the emotional classification performance.was Emotion was performed and Particular multiple sensor models. analysis Particularand focus wasfilter usingclassification single and multiple sensorusing basedsingle models. focusbased was overall CNN analysis and CNN design according to input data characteristics and noise removal for designoverall according to input datafilter characteristics and noise removal for data processing. data processing. Feature extraction performance was remarkably improved through the proposed filter design, Feature extraction performance was remarkably improved through the proposed filter design, providing significantly improved classification performance when compared with previous models. providing significantly improved classification performance when compared with previous models. This study paves the way for combining data and designing corresponding deep running This study paves the way for combining data and designing corresponding deep running models. FutureFuture research directions will investigate furtherfurther changes to the emotion analysisanalysis framework, models. research directions will investigate changes to the emotion such combining multiple neural networks. One approach would be to improve concatenation of simple framework, such combining multiple neural networks. One approach would be to improve convolution layers.ofItsimple may beconvolution possible tolayers. construct convolution for each data characteristic concatenation It may be possiblelayers to construct convolution layers for and improve classification using multiple convolution layers.multiple convolution layers. eachthe data characteristicperformance and improve the classification performance using Author Contributions: Yea-Hoon Kwon and Shin-Dug Kim conceived and designed the experiments; Yea-Hoon

Author Contributions: Yea-Hoon Kwon and Shin-Dug Kim conceived and designed the experiments; Yea-Hoon Kwon performed the experiments; Yea-Hoon Kwon and Sae-Byuk Shin analyzed the data; Yea-Hoon Kwon and Kwon performed the experiments; Yea-Hoon Kwon and Sae-Byuk Shin analyzed the data; Yea-Hoon Kwon and Sae-Byuk Shin wrote the paper. Sae-Byuk Shin wrote the paper. Funding: This work supported TheInstitute Institutefor for Information Information &&Communications Technology Promotion Funding: This work waswas supported byby The Communications Technology Promotion funded the Korean Government (MSIP) (R0124-16-0002, Emotional Emotional Intelligence Technology to Infer Human funded by thebyKorean Government (MSIP) (R0124-16-0002, Intelligence Technology to Infer Human Emotion and Carry on Dialogue Accordingly). Emotion and Carry on Dialogue Accordingly). Conflicts of Interest: authors declare conflictofofinterest. interest. Conflicts of Interest: The The authors declare nono conflict

References

References 1.

2.

3. 4.

1.

Gunadi, I.G.A.; Harjoko, A.; Wardoyo, R.; Ramdhani, N. Fake smile detection using linear support vector

Gunadi, I.G.A.; Harjoko, A.; Wardoyo, R.; Ramdhani, N. Fake smile detection using linear support vector machine. In Proceedings of the IEEE International Conference on Data and Software Engineering (ICoDSE), machine. In Proceedings of25–26 the IEEE International Yogyakarta, Indonesia, November 2015; pp.Conference 103–107. on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 25–26 November 2015; pp. 103–107. Wu, G.; Liu, G.; Hao, M. The analysis of emotion recognition from GSR based on PSO. In Proceedings of the IEEE International Symposium on Intelligence Information Processing and Trusted Computing (IPTC), Huanggang, China, 28–29 October 2010; pp. 360–363. Allen, J.J.B.; Coan, J.A.; Nazarian, M. Issues and assumptions on the road from raw signals to metrics of frontal EEG asymmetry in emotion. Biol. Psychol. 2004, 67, 183–218. [CrossRef] [PubMed] Schmidt, L.A.; Trainor, L.J. Frontal brain electrical activity (EEG) distinguishes valence and intensity of musical emotions. Cognit. Emot. 2001, 15, 487–500. [CrossRef]

Sensors 2018, 18, 1383

5. 6. 7. 8.

9. 10. 11.

12.

13.

14.

15.

16.

17. 18.

19.

20. 21. 22.

23.

24.

12 of 13

Khalili, Z.; Moradi, M.H. Emotion detection using brain and peripheral signals. In Proceedings of the Biomedical Engineering Conference, Cairo, Egypt, 18–20 December 2008. Mu, L.; Lu, B.-L. Emotion classification based on gamma-band EEG. In Proceedings of the Annual International Conference of the IEEE, Minneapolis, MN, USA, 3–6 September 2009. Liu, Y.; Sourina, O. EEG based dominance level recognition for emotion-enabled interaction. In Proceedings of the IEEE International Conference on Multimedia and Expo, Melbourne, Australia, 9–13 July 2012. Rozgic, V.; Vitaladevuni, S.N.; Prasad, R. Robust EEG emotion classification using segment level decision fusion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. Yin, Z.; Wang, Y.; Liu, L.; Zhang, W.; Zhang, J. Cross-subject EEG feature selection for emotion recognition using transfer recursive feature elimination. Front. Neurorobot. 2017, 11, 19. [CrossRef] [PubMed] Liu, Y.J.; Yu, M.; Zhao, G.; Song, J.; Ge, Y.; Shi, Y. Real-time movie-induced discrete emotion recognition from EEG signals. IEEE Trans. Affect. Comput. 2017. [CrossRef] Chanel, G.; Karim, A.-A.; Thierry, P. Valence-arousal evaluation using physiological signals in an emotion recall paradigm. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Montreal, QC, Canada, 7–10 October 2007. Zheng, W.L.; Dong, B.N.; Lu, B.-L. Multimodal emotion recognition using EEG and eye tracking data. In Proceedings of the 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 26–30 August 2014. Nie, D.; Wang, X.-W.; Shi, L.-C.; Lu, B.-L. EEG based emotion recognition during watching movies. In Proceedings of the 5th International IEEE/EMBS Conference on Neural Engineering, Cancun, Mexico, 27 April–1 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 667–670. Inuso, G.; La Foresta, F.; Mammone, N.; Morabito, F.C. Brain activity investigation by EEG processing: Wavelet analysis, kurtosis and Renyi's entropy for artifact detection. In Proceedings of the International Conference on Information Acquisition, Seogwipo-si, Korea, 8–11 July 2007; pp. 195–200. Rossetti, F.; Rodrigues, M.C.A.; de Oliveira, J.A.C.; Garcia-Cairasco, N. EEG wavelet analyses of the striatum-substantia nigra pars reticulata-superior colliculus circuitry: Audiogenic seizures and anticonvulsant drug administration in Wistar audiogenic rats (WAR strain). Epilepsy Res. 2006, 72, 192–208. [CrossRef] [PubMed] Akin, M.; Arserim, M.A.; Kiymik, M.K.; Turkoglu, I. A new approach for diagnosing epilepsy by using wavelet transform and neural networks. In Proceedings of the IEEE 23rd Annual International Conference on Engineering in Medicine and Biology, Istanbul, Turkey, 25–28 October 2001; pp. 1596–1599. Adeli, H.; Zhou, Z.; Dadmehr, N. Analysis of EEG records in an epileptic patient using wavelet transform. J. Neurosci. Methods 2003, 123, 69–87. [CrossRef] Adolphs, R.; Tranel, D.; Hamannb, S.; Youngc, A.W.; Calder, A.J.; Phelps, E.A.; Andersone, A.; Leef, G.P.; Damasio, A.R. Recognition of facial emotion in nine individuals with bilateral amygdala damage. Neuropsychologia 1999, 37, 1111–1117. [CrossRef] Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. Pons, G.; Masip, D. Supervised Committee of Convolutional Neural networks in automated facial expression analysis. IEEE Trans. Affect. Comput. 2017. [CrossRef] Ding, H.; Zhou, S.K.; Chellappa, R. Facenet2expnet: Regularizing a deep face recognition net for expression recognition. arXiv, 2016. Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining, Barcelona, Spain, 12–15 December 2016; pp. 439–448. Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt, A.; Patras, I. DEAP: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [CrossRef] Liu, Y.; Sourina, O. EEG based valence level recognition for real-time applications. In Proceedings of the IEEE international Conference on Cyberworlds, Darmstadt, Germany, 25–27 September 2012; pp. 53–60.

Sensors 2018, 18, 1383

25.

26. 27. 28. 29.

30. 31. 32. 33.

34. 35.

36.

37.

38.

39.

40.

13 of 13

Naser, D.S.; Saha, G. Recognition of emotions induced by music videos using DT-CWPT. In Proceedings of the IEEE Indian Conference on Medical Informatics and Telemedicine (ICMIT), Kharagpur, India, 28–30 March 2013; pp. 53–57. Chen, J.; Hu, B.; Moore, P.; Zhang, X.; Ma, X. Electroencephalogram based emotion assessment system using ontology and data mining techniques. Appl. Soft Comput. 2015, 30, 663–674. [CrossRef] Yoon, H.J.; Chung, S.Y. EEG based emotion estimation using Bayesian weighted-log-posterior function and perceptron convergence algorithm. Comput. Biol. Med. 2013, 43, 2230–2237. [CrossRef] [PubMed] Wang, D.; Shang, Y. Modeling physiological data with deep belief networks. Int. J. Inf. Education Technol. 2013, 3, 505. Li, X.; Zhang, P.; Song, D.; Yu, G.; Hou, Y.; Hu, B. EEG based emotion identification using unsupervised deep feature learning. In Proceedings of the SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research, Santiago, Chile, 13 August 2015. Yin, Z.; Zhao, M.; Wang, Y.; Yang, J.; Zhang, J. Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Comput. Methods Programs Biomed. 2017, 140, 93–110. [CrossRef] [PubMed] Christie, I.C.; Friedman, B.H. Autonomic specificity of discrete emotion and dimensions of affective space: A multivariate approach. Int. J. Psychophysiol. 2004, 51, 143–153. [CrossRef] [PubMed] Tabar, Y.R.; Halici, U. A novel deep learning approach for classification of EEG motor imagery signals. J. Neural Eng. 2016, 14, 016003. [CrossRef] [PubMed] Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. Karyana, D.N.; Wisesty, U.N.; Nasri, J. Klasifikasi EEG menggunakan Deep Neural Network Dengan Stacked Denoising Autoencoder. Proc. Eng. 2016, 3, 5296–5303. Zubair, M.; Yoon, C. EEG Based Classification of Human Emotions Using Discrete Wavelet Transform. In Proceedings of the Conference on IT Convergence and Security 2017, Seoul, Korea, 25–28 September 2017; pp. 21–28. Jadhav, N.; Manthalkar, R.; Joshi, Y. Electroencephalography based emotion recognition using gray-level co-occurrence matrix features. In Proceedings of the International Conference on Computer Vision and Image Processing, Roorkee, India, 26–28 February 2016; pp. 335–343. Hatamikia, S.; Nasrabadi, A.M. Recognition of emotional states induced by music videos based on nonlinear feature extraction and som classification. In Proceedings of the IEEE 21st Iranian Conference on Biomedical Engineering (ICBME), Tehran, Iran, 26–28 November 2014; pp. 333–337. Martínez-Rodrigo, A.; García-Martínez, B.; Alcaraz, R.; Fernández-Caballero, A.; González, P. Study of Electroencephalographic Signal Regularity for Automatic Emotion Recognition. In Proceedings of the International Conference on Ubiquitous Computing and Ambient Intelligence, Philadelphia, PA, USA, 7–10 November 2017; pp. 766–777. Zhang, X.-Y.; Wand, W.-R.; Shen, C.Y.; Sun, Y.; Huang, L.-X. Extraction of EEG Components Based on Time-Frequency Blind Source Separation. In Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Matsue, Japan, 12–15 August 2017; pp. 3–10. Mei, H.; Xu, X. EEG based emotion classification using convolutional neural network. In Proceedings of the IEEE International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Shenzhen, China, 15–17 December 2017; pp. 130–135. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).