Proceedings Template - WORD - ACM Digital Library

0 downloads 0 Views 2MB Size Report
Nov 12, 2016 - Convolutional Neural Network is fine-tuned on FER2013 training data for feature .... term “deep” means the network has more than three convolutional layers. .... information. Besides LLDs, eGeMAPS also provides other.
Audio and Face Video Emotion Recognition in the Wild using Deep Neural Networks and Small Datasets Wan Ding1, Mingyu Xu2, Dongyan Huang3, Weisi Lin4, Minghui Dong3, Xinguo Yu1, Haizhou Li3,5 1

Central China Normal University, China University of British Columbia, Canada 3 ASTAR, Singapore 4 Nanyang Technological University, Singapore 5 ECE Department, National University of Singapore, Singapore 2

[email protected], [email protected], {huang, mhdong, hli}@i2r.a-star.edu.sg, [email protected], [email protected]

ABSTRACT

1. INTRODUCTION

This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based subchallenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long ShortTerm Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .

As one of the key technologies for applications towards affective human computer interaction and human behavior analysis, audio-video emotion recognition has been an active research topic for decades. Early research primarily focused on posed emotion recognition and only in recent years has increasing attention been shifting towards emotion recognition in natural (wild) conditions. Along with advances in the emotion recognition community, various emotion recognition challenges such as the Facial Expression Recognition and Analysis Challenge (FERA) [1], the Audio/Visual Emotion Challenge (AVEC) [2] and the Emotion Recognition in the Wild Challenge (EmotiW) [3] have become standard benchmarks for people to study and test their approaches for emotion recognition in the wild. For audio-video emotion recognition, facial and vocal expressions are two streams of information that account for 93% of emotion expression cues [4]. Based on the approaches of temporal emotion cues modeling in video, there are three main categories of facial emotion recognition methods. In the first category, the approach models video data based on low level spatial-temporal facial features such as Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) and Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) [5-7]. The idea behind this first category of methods is to treat video data as three dimensional pixel volumes and apply image feature descriptors along all spatial and temporal dimensions. The second approach aims to treat video as a set of images and uses image methods for emotion recognition. Image set based methods view video frames as representations of the same object captured under different conditions (pose, illumination, etc). The third applies sequence models such as Recurrent Neural Networks (RNN) to capture the temporal cues among video frames for emotion recognition. Compared to the spatial-temporal feature based methods, the image set based methods and RNN based methods are more robust to the temporal variations of facial emotion expression. Given only small training video datasets, the image set based methods can achieve more promising results than RNN based methods [8-9, 37] provided effective image features.

CCS Concepts Computing methodologies~Neural networks

Keywords Audio-visual emotion recognition; Convolutional Neural Network; Long Short-Term Memory; Transfer learning

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. to Abstracting credit is ofpermitted. copy otherwise, oror Permission make digitalwith or hard copies all or part To of this work for personal republish,usetois post onwithout serversfeeorprovided to redistribute requires prior classroom granted that copiesto arelists, not made or distributed for profit or commercial copiesRequest bear this notice and the full from citation specific permissionadvantage and/orandathatfee. permissions on the first page. Copyrights for components of this work owned by others than ACM [email protected]. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, ICMI '16,November 12-16, 2016, Tokyo, Japan to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2016 ACM. ISBN 978-1-4503-4556-9/16/11…$15.00 DOI: http://dx.doi.org/10.1145/2993148.2997637. ICMI’16, November 12–16, 2016, Tokyo, Japan c 2016 ACM. 978-1-4503-4556-9/16/11...$15.00

http://dx.doi.org/10.1145/2993148.2997637

One way of extracting features for image set based facial emotion recognition is by using handcrafted methods. Liu et al. [8] combined traditional handcrafted image features such as DenseSIFT [9], Histogram of Oriented Gradients (HOG) [10] with different image set modeling methods [11-14]. Results showed that different handcraft image features have complementary

506

DCNN

Image feature extraction

Image set modeling

Video Classification

Prediction Scores Face video Fusion

Final results

Prediction Scores

Movie clip Audio signal Frame feature extraction

Classification

LSTM-RNN

Figure 1: Overview of the proposed emotion recognition system. The face video is provided by challenge organizers. For audio emotion recognition, we assume the non-vocal signals in movies also can contribute to emotion recognition; thus we do not preprocess vocal and non-vocal signal segments. effects for facial emotion recognition. Yao et al. [15] handcrafted a novel image feature method for emotion recognition based on the differences among facial local patches. They first aligned local patches by conducting face frontalization techniques [16], then conducted LBP feature descriptors on local patches, and finally used feature selection to detect the most discriminative patches. The differences of the LBP values among these patches are taken as the features for facial image emotion recognition. Their approach achieved good results on both image and video emotion recognition sub-challenges in EmotiW2015. Besides handcrafted image features, another way for image feature extraction is by using Deep Convolutional Neural Networks (DCNN). Here the term “deep” means the network has more than three convolutional layers. The learned convolutional kernels in CNN are Gabor-like feature descriptors [17] and the output of convolutional layers can be considered image feature vectors. To train an effective Deep CNN, a large task-oriented dataset is usually needed; however, facial emotion recognition datasets are usually small. To overcome this, Liu et al. [8] used the Celebrity Face in the Wild (CFW) [18] dataset (~170k images), originally collected for face identification, to train a deep CNN. The extracted features performed better than handcrafted features such as Dense-SIFT and HOG. Ng et al. [19] applied fine-tuning strategy to a pretrained deep CNN by using the Facial Expression Recognition 2013 (FER-2013) Dataset [20], a small (~30k images) facial emotion recognition dataset. The fine-tuned Deep CNN achieved good results on the EmotiW2015 static facial expression recognition sub-challenge. Kim et al. [37] trained multiple Deep CNN using small datasets and applied decision level fusion to achieve>60% accuracy for image based facial emotion recognition; however, effective feature extraction for the multi-DCNN fusion scheme still needs further study.

acoustic modeling tasks [2, 22-23, 27-29]. Compared to traditional models such as the Hidden Markov Model (HMM) [23] or standard Recurrent Neural Networks, LSTM-RNN can capture the cues at longer intervals (e.g. >100 time steps) without suffering from the vanishing gradient problem [25]. In this paper we combine different methodologies and propose an audio-video based emotion recognition approach that is the fusion of a facial emotion recognition system and an audio emotion recognition system. For face video emotion recognition we applied image set modeling methods and for audio emotion recognition we used an LSTM. Our contributions are two-fold. First, we applied the CNN fine-tuning on small dataset strategy [19] and found that the extracted CNN features are also effective for video facial emotion recognition. The second is that we developed an LSTM model for audio emotion recognition that achieved 7% higher accuracy than the audio baseline method using only the EmotiW2016 training data (773 instances). Details of the proposed method will be given in the following section.

2. PROPOSED METHOD 2.1 Face Video Emotion Recognition The proposed face video emotion recognition method is composed of three steps. Image features are first extracted using a fine-tuned DCNN. The next step is to extract video features based on image set modeling methods. The last step is classification. Since the image set based video features usually lie on nonEuclidean manifolds [13], kernels are also exploited to map them to Euclidean space for final classification. In our approach, for steps two and three, we directly applied the source code published by [8] for video feature extraction and classification.

2.1.1 Deep CNN Image Features

For audio emotion recognition, research study shows that the audio-visual decision level fusion usually achieves better results than single modality [8-9, 21-22]. Recently, Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) [26] have become popular for speech emotion recognition and other

Inspired by the organization of neurons in the visual cortex of animals, CNNs process spatial information through localconnectivity and have network designs such as weight sharing for translational invariance and parameter reduction.

507

The CNN feature extraction process is to apply the learned convolutional kernels of Deep CNN to find feature descriptors. Suppose the input image is IW,H,C where W denotes width, H denotes height and C denotes number of channels (note that the input image does not have to be in RGB space). For one local region Lw,h,C in I, where w