Micro-Expression Recognition with Expression ... - ACM Digital Library

0 downloads 0 Views 4MB Size Report
[email protected]. ABSTRACT. Recognizing spontaneous micro-expression in video sequences is a challenging problem. In this paper, we propose a new.
Micro-Expression Recognition with Expression-State Constrained Spatio-Temporal Feature Representations Dae Hoe Kim

Wissam J. Baddar

Yong Man Ro*

Image and Video Systems Lab. KAIST Daejeon, Korea

Image and Video Systems Lab. KAIST Daejeon, Korea

Image and Video Systems Lab. KAIST Daejeon, Korea

[email protected]

[email protected]

[email protected]

hand-crafted spatial texture features (i.e., local binary pattern (LBP), local quantized pattern (LQP)) to spatio-temporal features, such as the LBP on three orthogonal planes (LBPTOP), LBP mean orthogonal planes (LBP-MOP), LBP with six intersection points (LBP-SIP) and spatio-temporal completed LQP (STCLQP). However, the small scale of facial motion in micro-expressions results in feature vectors with low discriminative power. To improve the discriminative power of such descriptors, an integral projection technique [6] was proposed to preserve the property of micro-expressions and then enhance discrimination of micro-expressions. Authors in [13] reported adaptive motion magnification techniques to emphasize micro-expression motion. Although, motion magnification improves expression class separability, magnification parameters, such as frequency bands and magnification factors are sensitive to achieve a good performance. Moreover, the existing hand-crafted features have a limitation in the sense that they rely on a prior knowledge and heuristics [2].

ABSTRACT Recognizing spontaneous micro-expression in video sequences is a challenging problem. In this paper, we propose a new method of small scale spatio-temporal feature learning. The proposed learning method consists of two parts. First, the spatial features of micro-expressions at different expression-states (i.e., onset, onset to apex transition, apex, apex to offset transition and offset) are encoded using convolutional neural networks (CNN). The expression-states are taken into account in the objective functions, to improve the expression class separability of the learned feature representation. Next, the learned spatial features with expression-state constraints are transferred to learn temporal features of micro-expression. The temporal feature learning encodes the temporal characteristics of the different states of the micro-expression using long short-term memory (LSTM) recurrent neural networks. Extensive and comprehensive experiments have been conducted on the publically available CASME II micro-expression dataset. The experimental results showed that the proposed method outperformed state-of-the-art micro-expression recognition methods in terms of recognition accuracy.

In this paper, we propose a new micro-expression feature representation that is learned with expression-states. The proposed learning method consists of two-parts. First, the spatial features of micro-expression at different expression-states (i.e., onset, onset to apex transition, apex, apex to offset transition and offset) are encoded using convolutional neural networks (CNN). By adopting expression-states in the objective function at the spatial feature learning, the expression class separability of the learned micro-expression features is improved. After the micro-expression spatial feature representation is learned, the learned model is transferred to the second part to extract time scale dependent features from the micro-expression sequence. The temporal feature learning of micro-expression encodes the temporal characteristics of the different expression-states of the micro-expression using long short-term memory (LSTM) recurrent neural networks. Extensive and comprehensive experiments have been conducted on the CASME II microexpression dataset [17]. The experimental results showed that the proposed method outperformed state-of-the-art microexpression recognition methods in terms of recognition accuracy. The remainder of this paper is organized as follows. Section 2 details the proposed micro-expression recognition with expression-state constrained spatio-temporal feature representations. Section 3 presents experimental setup and results. And the conclusions are drawn in section 4.

Keywords Micro-Expression Recognition; Recurrent Neural Networks; Long Short Term Memory

1. INTRODUCTION Micro-expressions are spontaneous, brief and subtle facial movements that appear involuntarily reveling the genuine emotion of the subject [18]. Those brief micro-expressions are cumbersome to be recognized because they take a number of frames without significant motion [9]. Moreover, the subtlety of the micro-expressions can result in spatial features that are insufficient for recognizing the expressions, even at the apex frames. Relatively few attempts on recognizing micro-expressions have been conducted [6; 7; 15; 16]. The methods tried to improve micro-expression recognition accuracy by extending * Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM '16, October 15-19, 2016, Amsterdam, Netherlands © 2016 ACM. ISBN 978-1-4503-3603-1/16/10…$15.00

2. PROPOSED METHOD Figure 1 shows an overview of the proposed expressionstate constrained spatio-temporal feature representation learning for micro-expression recognition. As shown in the figure, the proposed method is composed of two parts. In the first part, spatial feature representation of the micro-expression is learned

DOI: http://dx.doi.org/10.1145/2964284.2967247

382

Figure 1. Overall process of the proposed expression-state constrained spatio-temporal feature representation learning for micro-expression recognition using CNN. In this part, expression-states are taken into account otherwise), and tˆic is the predicted probability that the sample in learning the CNN to differentiate expression classes, i.e., the belongs to the class c calculated at the last L-th layer. expression-states dependent objective functions are devised in To mitigate intra-class variations due to subject appearance the learning. With the learned model in the first part, spatial variations, we devise the term E2 as: features of all input frames are extracted and fed to the second 2 1   c learning part. In the second part, the temporal features of microE2   g  y c, p,i  y c  (d min ) 2 , (2) 2 2 c, p ,i  expression are consequently encoded by using LSTM. The  details of each part are described in the following subsections. where yc,p,i is the spatial feature vector of the i-th training sample (xc,p,i) of class c and the p-th expression-state extracted at

2.1 Learning Micro-Expression Spatial Features with Expression-State Constraints

the (L-1)-th layer (i.e., fully connected layer), y c is the mean of spatial feature vectors of training samples in class c calculated at c the beginning of every epoch. d min is half the minimum distance

In learning the spatial features, we consider expressionstates including onset, apex and offset of expressions. We sample representative frames from video sequences that represent different expression-states of each expression. Specifically, 5 frames are selected in the learning; onset, the middle frame between onset and apex, apex, the middle frame between apex and offset, and offset are selected. To learn the spatial features of the micro-expression sequence frames, CNN is utilized. Recognition performance of spatial feature learned by CNN could vary depending on objective functions [14]. Thus, the design of the objective function based on a prior knowledge is important to determine the performance of CNN. In this paper, we propose a new objective function consisting of multiple objective terms to learn discriminative micro-expression spatial features that can encode subtle micro-expression regardless of subject variations and expression-state variations. Figure 2 shows visualization of each objective term in the proposed objective function. The first term (E1) of the objective function is responsible for recognizing the expression by minimizing the expression classification error. To that end, softmax loss function is utilized as: E1    tic log tˆic , (1) i

between y c and y j for j ≠ c. The function g ( )

is a

smoothed approximation of [ ]   max( 0,  ) , defined as

g ( )  log(1  exp( )) /  where  is a sharpness parameter [10]. By minimizing E2, the spatial features included in each expression class are forced to be gathered in each multidimensional sphere, where the radius and the center of sphere c and y c , respectively. As a result, this term are denoted by d min of the objective function reduces the within-class scatter of the learned spatial features and improves expression class classification. Frames in a micro-expression sequence appear similar due to the subtlety of the expression motion. Such similarity between the frames can result in similar feature representations along the sequence frames. As a result, temporal changes become harder to model. To increase the distances between the learned features of different expression-states, we devise three objective terms (E3, E4 and E5). E3 is added to distinguish the expression-state by minimize the expression-state classification error as: E3   tip log tˆi p , (3)

c

where c is the expression class index, tic is the expression ground truth of the i-th sample (1 if c is the correct class and 0

i

p

where p is the expression-state index , tip is the expression-state ground truth of the i-th sample (1 if p is the correct expression-

383

y1 1 d min

c d min

yc,1

y c ,1

2 d min

c d min

y c,1

Minimizing intra-class variation (E2)

Expression-state classification (E3)

yc,2

y c,0



y2

Expression classification (E1)

y c,2

yc

Minimizing Expression-state variation (E4)

y c ,1

y c,0

Expression-state continuity (E5)

Figure 2. Visualizing the objective terms used when learning the micro-expression spatial features with expression-state constraints inputs for the LSTM are the spatial features extracted from all sequence frames using the model learned in the first part. The depth of LSTM structure in this paper is two layers, where each LSTM unit contains 512 hidden states. Note that spatiotemporal feature representation is learned with the CNN in the first part and consequent LSTM in the second part.

state and 0 otherwise), and tˆi p is the predicted probability that the sample belongs to the expression-state p at the last L-th layer. Similar to E1, E3 is a softmax objective function that helps identify the expression-state of the frame. To minimize the variations of each expression-state, the term E4 is devised as: 2  2  dc   1 E4   g  y c , p ,i  y c , p   min  , (4)  2 2 c , p ,i      

3. EXPERIMENTS 3.1 Experimental Setup To verify the effectiveness of the proposed micro-expression recognition, experiments have been conducted on the CASME II dataset [17]. The dataset contains 246 spontaneous microexpression sequences collected from 26 subjects with a temporal resolution of 200 fps. Each micro-expression sequence was categorized into one of five expression classes (i.e., happiness, disgust, repression, surprise, and others). The dataset also provides indexes of the onset, apex (peak), and offset frame in each sequence. In our experiments, micro-expression sequences from onset to offset were used, and the face region was cropped and aligned according the face region in the first frame [13]. The evaluations in our experiments were conducted with a leave-one-subject-out cross validation, such that the test subject was excluded from the training set. Due to the limited number of samples in the CASME II dataset, data augmentation was performed during the training to avoid overfitting [8]. In particular, each training video sequence was flipped horizontally, rotated between the angles [-10º, 10º] with an increment of 5º, translated along [(0, 0), (-2, -2), (-2, +2), (+2, 2), (+2, +2)] pixels in x and y axis, and scaled with scaling factors of 0.9, 1.0 and 1.1. As a result 150 augmented sequences from each training sequence were generated. To implement the proposed micro-expression spatial feature learning with expression-state constrained CNN, ConvNet library [4] was utilized, while the LSTM was implemented using Keras [3]. The implemented CNN structure consisted of an input layer of size 64×64 pixels with RGB color channels, three convolutional layers with a 3×3 filters and the number of filters is 32, 64 and 64, respectively. Each convolutional layer was followed by a max pooling layer of kernel size 3x3, stride 2. Lastly, two fully-connected layers with 512 units were used. For the activation function, the rectified linear unit (ReLU) [11] was used. The CNN was trained for 40 epochs, then the learned model was transferred to the second part, such that the input features to the LSTM were extracted from the second fully connected layer.

where y c, p is the mean of spatial feature vectors of training samples in the c-th class and the p-th expression-state calculated at the beginning of every epoch.  is a parameter for determining the range of expression-state distribution. This term of the objective function helps cluster the features of the same expression according to the expression-states, which minimizes the effect of the expression-state variations on the microexpression classification task. The terms E3 and E4 improve the separability between expression-states. However, feature representations from adjacent frames could not be close to each other in feature space. This could degrade video classification performance in the next temporal learning (i.e., temporal feature learning with LSTM). Thus, the last term, E5, is devised to regulate the expressionstate continuity. As a result, features extracted from frames between two expression-states, which are used during the learning, could reside in the middle of two expression-states in the feature space. E5 is calculated as: E5 

 1   g  y c ,q  y c , p 2 c p , p 0  

 2d c   min 2   2

   

2

 , q   p  1, if p  0    p  1, if p  0, 

(5)

where q represents the index of the adjacent expression-state in the learning stage. It should be noted that to enforce expressionstate continuity, the expression-state indexes p and q were defined as (-2, -1, 0, 1 and 2) for onset, onset to apex transition, apex, apex to offset transition and offset expression-states, respectively.

2.2 Learning Micro-Expression Temporal Features with LSTM In the first learning part, spatial feature representations are learned with the proposed objective function that improves expression class separability. In the second part, time scale dependent information that resides along the video sequences is consequently learned by using LSTM. LSTM is able to process long sequences of data efficiently [5]. Moreover, it can encode the temporal dependent characteristics of the variable length expression sequences without temporal normalization. The

384

0.25 0.20 0.15 0.10 0.05

E1 Training samples Testing samples

Table 2. Micro-expression recognition accuracy comparison with existing state-of-the-art Method Acc. (%) LBP-TOP [19] 44.12 LBP-TOP + adaptive motion magnification [13] 51.91 LBP-MOP [16] 45.75 Riesz wavelet [12] 46.15 Proposed method 60.98 frames from each sequence. From the recognition results, we can observe that E2 improves recognition accuracy because it reduces the effect of intra-class variation of each expression, such as person identity variations. When employing expressionstate related terms (E3, E4 and E5) in the objective function, the micro-expression recognition accuracy improves due to the improvement of expression class separability by incorporating expression-state information.

0.30

8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00

E1 + E2

E1+ E2+E3

E1 + E2+ E3 + E4

E1 + E2+ E3 + E4 + E5

1.162

1.364

5.392

6.525

6.763

0.078

0.158

0.221

0.251

0.262

0.00

Figure 3. Fisher's discriminant ratio for class separability measure Table 1. Micro-expression recognition accuracy (in the CNN) with different objective terms of spatial feature learning Objective terms Acc. (%) E1 40.65 E1+E2 48.37 E1+E2+E3 54.07 E1+E2+E3+E4 54.88 All ( E1+E2+E3+E4+E5) 58.54

3.3 Effectiveness of the Proposed MicroExpression Spatio-Temporal Features Table 2 shows the performance of the proposed microexpression recognition with expression-state constrained spatiotemporal feature representations including the CNN and the LSTM (in Figure 1). Comparative results are shown with the state-of-the-art expression recognition methods as well. In this experiment, we evaluated the micro-expression recognition accuracies using the micro-expression video sequences. As shown in the table, the proposed method clearly outperformed existing state-of-the-art micro-expression recognition methods and the methods with CNN only (Table 1) on the same dataset. These results indicate that, the proposed method could improve the micro-expression recognition accuracy by incorporating spatial and temporal dependent information.

3.2 Performance of the Proposed MicroExpression Spatial Feature Learning with Expression-State Constraints In this experiment, a quantitative analysis was performed to evaluate the effect of each term of the objective function when learning the micro-expression spatial features with expressionstate constraints. To that end, Fisher's discriminant ratio [1] was utilized for class separability measure. Fisher's discriminant ratio (J) measures the ratio of between-class scatter (Sb) and the within-class scatter (Sw) as follows: J

trace ( S b ) , trace ( S w )

Nc

Nc Ni

i1

i 1 j 1

Sb  (mi m)(mi m), Sw  (xij  mi ) (xij  mi ) ,

4. CONCLUSION In this paper, we presented a method for recognizing microexpressions by learning a spatio-temporal feature representation with expression-state constraints. First, CNN was utilized to encode the spatial characteristics of the facial expression at different expression-states. Then, the learned model with expression-state constraints was transferred to a microexpression temporal feature learning stage which adopts LSTM. The experiments showed that employing the expression-states in the CNN objective function improves the expression class separability of the learned micro-expression features. Moreover, consequent temporal feature learning with LSTM was able to encode the temporal characteristics of the different states of the micro-expression, which generated sufficient representations of the underlying micro-expression. Finally, the experimental results showed that the proposed method was superior to existing state-of-the-art micro-expression recognition methods in terms of recognition accuracy. For future work, a more evaluation of the proposed method is to be conduct on additional datasets of spontaneous facial expression with various kinds of metrics (i.e., F1 score).

(6)

where Nc is the number of classes, Ni is the number of samples in class i, mi is the mean of the samples in class i, m is the mean of all classes and xij is the feature vector of j-th sample in class i. Higher Fisher's discriminant ratio is achieved with larger Sb and smaller Sw which results in better expression class separability. Figure 3 shows the Fisher’s discriminant ratio according to various objective term combinations in the proposed microexpression spatial features. As shown in the figure, with the term that minimizes expression classification error only (i.e., E1), the Fisher’s discriminant ratio is low. When employing E2, which focuses on reducing the intra-class variations of each expression, the Fisher’s discriminant ratio is increased. A more significant increase is observed when including terms related to the expression-state (i.e., E3, E4 and E5) in the objective function. To further evaluate the effectiveness of proposed objective terms, we perform expression recognition using the learned spatial features with expression-state constraints, by incrementally adding terms to the objective function. As a result, 5 models were learned as shown in table 1. Please note that the evaluation was performed using the CNN with apex

5. ACKNOWLEDGMENTS This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A2A01005724).

385

[12] Oh, Y.-H., Le Ngo, A.C., See, J., Liong, S.-T., Phan, R.C.-W., and Ling, H.-C., 2015. Monogenic Riesz wavelet representation for micro-expression recognition. In Digital Signal Processing (DSP), 2015 IEEE International Conference on IEEE, 1237-1241. [13] Park, S.Y., Lee, S.H., and Ro, Y.M., 2015. Subtle Facial Expression Recognition Using Adaptive Magnification of Discriminative Facial Motion. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference ACM, 911-914. [14] Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., and Fergus, R., 2013. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1058-1066. [15] Wang, Y., See, J., Phan, R.C.-W., and Oh, Y.-H., 2014. LBP with Six Intersection Points: Reducing Redundant Information in LBP-TOP for Micro-expression Recognition. In Computer Vision-ACCV 2014 Springer, 525-537. [16] Wang, Y., See, J., Phan, R.C.-W., and Oh, Y.-H., 2015. Efficient Spatio-Temporal Local Binary Patterns for Spontaneous Facial Micro-Expression Recognition. PloS One 10, 5, e0124674. [17] Yan, W.-J., Li, X., Wang, S.-J., Zhao, G., Liu, Y.-J., Chen, Y.-H., and Fu, X., 2014. CASME II: An improved spontaneous micro-expression database and the baseline evaluation. PloS One 9, 1, e86041. [18] Yan, W.-J., Wu, Q., Liang, J., Chen, Y.-H., and Fu, X., 2013. How fast are the leaked facial expressions: The duration of micro-expressions. Journal of Nonverbal Behavior 37, 4, 217-230. [19] Zhao, G. and Pietikainen, M., 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29, 6, 915-928.

6. REFERENCES [1]

Bartlett, M.S., Movellan, J.R., and Sejnowski, T.J., 2002. Face recognition by independent component analysis. Neural Networks, IEEE Transactions on 13, 6, 1450-1464. [2] Chherawala, Y., Roy, P.P., and Cheriet, M., 2013. Feature design for offline Arabic handwriting recognition: handcrafted vs automated? In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on IEEE, 290-294. [3] Chollet, F., 2015. Keras. Available: https://github.com/fchollet/keras [4] Demyanov, S., ConvNet. Available: https://github.com/sdemyanov/ConvNet [5] Gers, F.A., Schraudolph, N.N., and Schmidhuber, J., 2003. Learning precise timing with LSTM recurrent networks. The Journal of Machine Learning Research 3, 115-143. [6] Huang, X., Wang, S.J., Zhao, G., Piteik, M., and Inen, 2015. Facial Micro-Expression Recognition Using Spatiotemporal Local Binary Pattern with Integral Projection. In 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), 1-9. [7] Huang, X., Zhao, G., Hong, X., Zheng, W., and Pietikäinen, M., 2016. Spontaneous facial microexpression analysis using Spatiotemporal Completed Local Quantized Patterns. Neurocomputing 175, Part A(1/29/), 564-578. [8] Khorrami, P., Paine, T., and Huang, T., 2015. Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? In Proceedings of the IEEE International Conference on Computer Vision Workshops, 19-27. [9] Le Ngo, A.C., Liong, S.-T., See, J., and Phan, R.C.-W., 2015. Are subtle expressions too sparse to recognize? In Digital Signal Processing (DSP), 2015 IEEE International Conference on IEEE, 1246-1250. [10] Mignon, A. and Jurie, F., 2012. Pcca: A new approach for distance learning from sparse pairwise constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on IEEE, 2666-2672. [11] Nair, V. and Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807-814.

386