BRIDGING THE GAP BETWEEN EXPRESSION AND EMOTION ...

SEEING THROUGH THE EXPRESSION: BRIDGING THE GAP BETWEEN EXPRESSION AND EMOTION RECOGNITION Lun-Kai Hsu1,2 , Wen-Sheng Tseng1,2 , Li-Wei Kang3 , and Yu-Chiang Frank Wang2 1

Dept. Electrical Engineering, National Taiwan University, Taipei, Taiwan Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan 3 Dept. of Comput. Sci. & Info. Eng., National Yunlin Univ. of Science & Technology, Yunlin, Taiwan 2

Arousal

ABSTRACT

Index Terms— Expression recognition, emotion recognition, subspace learning 1. INTRODUCTION In addition to language, facial expression is another form of social communication in our daily lives. According to social psychologists, meaningful messages conveyed in a conversation can be dominated by facial expressions but not spoken words [1]. Based on visual appearances expressed by our faces, the associated emotional response or intention can be delivered in an implicit yet forceful way. While facial expression has a strong correlation with human emotion, how to interpret human facial expression in terms of emotion remains a challenge. The studies of expression recognition typically categorize the facial expression images into six popular categories (also known as universal facial expressions): anger, disgust, fear, happiness, sadness, and surprise. In [2], Ekman and Friesen performed extensive studies of human facial expressions, which provided evidences to support this categorization. Their studies showed

high

annoying excited angry happy ? pleased

nervous

positive

negative

sad

?

Valence

In this paper, we propose a novel approach for visualizing and recognizing different emotion categories using facial expression images. Extended by the unsupervised nonlinear dimension reduction technique of locally linear embedding (LLE), we propose a supervised LLE (sLLE) algorithm utilizing emotion labels of face expression images. While existing works typically aim at training on such labeled data for emotion recognition, our approach allows one to derive subspaces for visualizing facial expression images within and across different emotion categories, and thus emotion recognition can be properly performed. In our work, we relate the resulting two-dimensional subspace to the valence-arousal emotion space, in which our method is observed to automatically identify and discriminate emotions in different degrees. Experimental results on two facial emotion datasets verify the effectiveness of our algorithm. With reduced numbers of feature dimensions (2D or beyond), our approach is shown to achieve promising emotion recognition performance.

?

relaxed peaceful sleepy calm

bored

low

Fig. 1. The valence-arousal (V-A) space for emotion analysis.

that, although the tasks of facial expression recognition and emotion recognition share the same processes and goals, their differences still exist due to the imposed social rules. Nevertheless, recognizing facial expression or emotion benefits a variety of multimedia applications such as humancomputer interaction, consumer electronics, and intelligent tutoring systems. Existing expression or emotion methods can be generally divided into two categories: action-unit (AU) and appearance-based approaches. AU is defined by Facial Action Coding System (FACS) [3], which specifies contraction or relaxation of one or more facial muscles as feature unit (e.g., AU number 17 denotes chin raising). Appearance-based methods reveal the facial features associated with different expressions in terms of visual appearance. Different kinds of features have been investigated for expression/emotion recognition (e.g., Gabor wavelets [4], histogram of orientation gradients (HOG) [5], or local binary pattern (LBP) [6]). Once the feature is extracted from face images, one can apply existing classifiers for training and predicting. For emotion recognition, psychologists argue that emotions should be considered as regions falling in a 2D space with axes of valence and arousal (i.e., V-A space) [7], as illustrated in Figure 1. Valence represents the positive or negative aspects of human emotions, and arousal describes the degree of human physiological state of being reactive. Although the use of V-A space allows one to visualize different emotion

statuses, it is not clear how to properly determine the emotion labels of face images with expression variations, and how to automatically locate such images in the V-A space. In this paper, we present a novel supervised locally linear embedding (sLLE) algorithm for addressing the above problem. While existing supervised versions of LLE exist [8], our sLLE performs subspace learning using emotion labels of face images, and thus our approach can be particularly applied to visualize facial expression images in terms of the inherent emotion information. As a result, the derived subspace (2D or beyond) allows us to automatically identify face images of different emotions in different degrees, so that emotion recognition can be performed accordingly. 2. RELATED WORK 2.1. Facial Expression Analysis and Recognition The Facial Action Coding System (FACS) [3] is among the most popular studies which focus on representing facial activities. In particular, it defines action units (AU) for describing contraction or relaxation of one or more facial muscles as feature units, and thus one can determine subtle changes in facial appearances caused by the associated contractions. In contrast to FACS which requires the definition of AU, appearance-based approaches utilize different types of features for representing facial expressions. Among visual features such as Gabor, haar-like, and local binary patterns (LBP), LBP has been shown to be effective in expression analysis or recognition works like [6, 9]. In this paper, we will also apply LBP to describe facial expression images for expression/emotion analysis and recognition. For facial expression recognition, existing methods either aim at developing feature selection techniques [10, 5, 9] or designing novel classification algorithms [11] for improved performance. Our proposed method in this paper can be considered as a feature selection (dimension reduction) technique for analyzing facial expression images for emotion recognition. While existing methods like [9] advances similar ideas to identify representative regions of face images, some require annotated training images or user interaction for addressing this task (e.g., [10, 5]). Different from most prior works on expression recognition, our proposed algorithm utilizes the emotion labels of facial expression images and performs supervised subspace learning. As detailed later in Section 3, this derived subspace can be considered as typical or improved valence-arousal emotion spaces for visualizing and addressing the task of emotion recognition. 2.2. V-A Space and Emotion Recognition Psychological emotion models can be typically divided into two categories: discrete emotion states and dimensional continuous emotion spaces [12]. The 2D valence-arousal (V-A) space is considered to be more general than the use of discrete

states in solving those tasks [10], and it has been widely applied in many emotion analysis or recognition works. Using a 2D space with one axis representing valence and the other for arousal, emotions are considered as numerical values (instead of discrete labels, such as happiness and sadness) over the two emotion dimensions. For example, happiness and sadness should be far away from each other in this space, while sadness and anger might be closer instead. It is clear that the use of V-A space provides a simple yet powerful way in representing and visualizing different emotion categories. It allows us to perform direct comparisons and recognition of different emotions on these two informative dimensions [13]. Existing facial expression or emotion recognition works require the collection of annotated image data (i.e., those with emotion labels given in advance) for training. However, most of these works do not have an automatic way to properly project or visualize such data in the V-A space. For example, some approaches (e.g., [10, 13]) require the user to manually annotate and locate the projected data in the 2D V-A space for training or further processing tasks. Since intra-class variations exist for images in each emotion category, it might not be preferable for manually performing such training or processing procedures (e.g., instances with the same emotion but in different degrees should also be away from each other). In this paper, we propose a novel supervised subspace learning algorithm, which performs feature selection and dimension reduction techniques for improved emotion recognition. A major advantage of our sLLE is its capability of visualizing and recognizing facial expression images in terms of the inherent emotion information. 3. OUR PROPOSED METHOD When analyzing facial expression images for emotion recognition, one typically extracts visual features from face images for training and testing. When designing the classifiers using training data, the emotion labels of such images are generally annotated by users without discriminating between different degrees of each emotion, or dealing with possible ambiguity between different emotion categories. In other words, the classifiers are expected to handle both intra and inter-class variations simply based on labeled training data. We consider facial expression images with emotion labels as high-dimensional data, and each will be visualized and located in a subspace for emotion analysis and recognition. Inspired by LLE, we propose a novel supervised locally linear embedding (sLLE) algorithm for preserving the neighborhood embeddings using both data structure and label information. Different from LLE, our sLLE utilizes the label information of the input data, and we aim at deriving a low-dimensional space which exploits and preserves both label and structure of such data (as shown in Figure 2). As a result, the subspace of sLLE can be applied to visualize and recognize facial expression data in terms of the inherent emotion information.

pler and easier to implement. Later from our experiments, we will verify the effectiveness of our proposed algorithm. Similar to LLE, we select the K nearest neighbors of each instance xi using (1) for reconstruction purposes. To determine the optimal weights for this reconstruction procedure, we need to solve the following minimization problem: min (W ) = W

N X

||xi −

i=1

K X

wij xij ||2 , s.t.

j=1

K X

wij = 1.

(2)

j=1

In (2), N is the total number of instances, and xij (j = 1, . . . , K) indicates the K nearest neighbors of xi . Each entry wij in W ∈ RN ×K denotes the weight of xij for reconstructing xi . As a result, each row of W represents the associated weights for reconstructing xi using its neighbors xij . PK For normalization purposes, we require that j=1 wij = 1. Now, for the i-th row wi = [wi1 , . . . , wiK ] of W , we need to solve the optimization problem below: min k|xi − wi

Fig. 2. Comparisons of LLE and our proposed method. (a) Data instance xi and its neighbors in the original high-dimensional space, (b) derivation of reconstruction weights for the neighbors of xi using LLE, (c) visualization of projected data in a lower-dimensional space with LLE, (d) derivation of reconstruction weights for the neighbors of xi while taking label information into consideration, and (e) visualization of projected data in a lower-dimensional space using our method. Note that different colors denote distinct categories.

3.1. Algorithm of sLLE We now detail our propose sLLE algorithm. Suppose that xi and xj are d-dimensional feature vectors of facial expression images i and j. While we calculate the Euclidean distance between all pairs of instances, the resulting distance will be scaled by α depending on the corresponding emotion label pair. In other words, the distance between xi and xj in the original d-dimensional space is calculated as:

K X

wij xij ||2 = ||xi

K X j=1

j=1

= ||

K X

wij −

K X

wij xij ||2

j=1

(3) 2

wij (xi − xij )|| =

wTi Cwi ,

j=1

where Cjk = (xi −xij )T (xi −xik ) and C ∈ RK×K . Applying the technique of Lagrange multipliers, we can formulate the following Lagrange function to be minimized: (wi , λ) = wTi Cwi − λ(wTi 1 − 1).

(4)

Taking the derivative of (4) with respect to wi and setting it to zero, the optimal solution of wi can be derived by solving: Cwi = λ1, and wi = λC −1 1.

(5)

PK To satisfy j=1 wij = 1, we normalize the resulting solution by wi /||wi ||2 . Once the solutions for each row wi are produced, the learning part of our sLLE algorithm is complete. 3.2. Computing the Embedding Coordinates

distance(xi , xj ) = α||xi − xj ||2 .

(1)

In (1), we have α ∈ (0, 1) if xi and xj are with the same emotion label, and α = 1 if they belong to different emotion categories. We note that, the parameter α balances the supervision (i.e., the use of label information) for LLE, which preserves the local structure of data while introducing additional separation between instances of different classes. In our work, we observe that α = 0.5 achieves satisfactory performance. Compared to the prior supervised version of LLE [8], which requires to calculate the maximum distance between within-class instances and also the use of a weighting factor for penalizing the resulting distance, our formulation is sim-

Once W is derived for exploiting both label and structure of the input high-dimensional data, we now discuss how we compute the embedding coordinates in a low-dimensional space for dimension reduction and visualization. ˜ ∈ RN ×N as a N by N matrix, in which We now define W N denotes the total number of instances. For the ith row of ˜ , each entry indicates the weights for all data points for W reconstructing the instance xi . In other words, we have w ˜ij = wij if xj is among the K nearest neighbors of xi , and w ˜ij = 0 otherwise. In order to convert each xi into y i in a lower ldimensional feature space (as shown in Figure 2(e)), we need to solve the following problem:

Fig. 3. Visualization of our dimension reduction results for the JAFFE database.

minΦ(y i ) = ||y i − yi

N X

w ˜ij y j ||2

j=1

s.t.

X i

(6)

1 X y y T = I. y i = 0 and N i i i

crop the central region of size 120 × 120 pixels and divide it into 64 patches of equal sizes, and a 59-dimensional LBP feature vector [6] is extracted from each patch for describing the textural information. Finally, we concatenate all 64 LBP feature vectors into a single 3776-dimensional feature vector.

We can rewrite the above formulation as follows: min Φ(y i ) = ||y i − yi

N X

4.1.1. Visualization 2

w ˜ij y j ||

j=1

=

X

(7)

mij y Ti y j = trace(Y M Y T ),

i,j

˜ )T (I − W ˜ ) and Y = [y 1 , . . . , y N ] is where M = (I − W composed of column vectors y i (i = 1, . . . , N ). Note that the function trace returns the summation of the eigenvalues of the input matrix. By advancing singular value decomposition (SVD), we are able to calculate the eigenvectors of M associated with the smallest l eigenvalues, i.e., U = [u1 , u2 , ...ul ] (see [14] for derivation details). Finally, we have Y = U T as the projected data in the lower l-dimensional space. Comparing to the standard LLE, our sLLE algorithm takes advantage of the label information when constructing the embedding subspace, and thus sLLE is more likely to cluster the instances with the same label as neighbors. As depicted in Figure 2 and later verified by our experiments, the derived lower-dimensional subspace can be utilized for dimension reduction, visualization, and recognition purposes. 4. EXPERIMENTS 4.1. JAFFE Database We first conduct experiments on the JAFFE (Japanese Female Facial Expression) database [15]. This database contains seven types of facial expressions with a total of 213 images posed by 10 Japanese females. The seven emotion categories include happiness, sadness, fear, anger, surprise, disgust, and neutral. For each facial expression image, we

In Figure 3, we plot our dimension reduction results of all images of the JAFFE database in the derived two-dimensional subspace. It can be seen that this subspace can be considered as the V-A space [7] for describing the emotion status, since the emotion labels were properly assigned into the associated quadrants. For example, the face images with the label happy are located at the first quadrant, which indicates emotion with high arousal and positive valence. On the other hand, for those with the label disgust, they are assigned into the second quadrant with high arousal but very negative valence. Although the emotion disgust is typically not defined in the V-A space (see Figure 1), our proposed method is still able to locate the corresponding images in the resulting subspace. It is worth noting that, different from the use of the standard V-A space which requires users to manually locate the instances of interest in it, our approach is able to automatically observe the locations and distributions of different facial expression images with emotion labels. We show a pair of face images for each emotion in Figure 3. While one of them is close to the origin of the subspace, the other is far away from it. Take surprise for example, the upper image (farther away from the subspace origin) indeed looks more surprised than the lower one which is closer to the origin. This also verifies the use of our proposed method for connecting facial expression and emotion images, and the derived two-dimensional space is equivalent to the V-A space. 4.1.2. Emotion recognition To apply our proposed method for facial emotion recognition, we compare recognition results over different dimen-

Table 1. Comparisons of emotion recognition results for JAFFE. Acc Dim Acc Dim

Eigenface 25.0% 2 MTSL 40.0% 177

Eigenface 56.3% 10 MTSL 46.0% 531

Fisherface 31.25% 5 LLE 56.25% 10

kNN 53.1% 3776 Ours 62.5% 2

SVM 71.9% 3776 Ours 84.4% 10

Fig. 5. Example images of the CK+ database.

Fig. 4. Example recognition errors for JAFFE. (a) Image of sad recognized as fear, (b) a training image of fear, (c) image of angry recognized as sad, and (d) a training image of sad.

sions with those produced by Eigenface [16], Fisherface [16], kNN, SVM, standard LLE, and the recent approach of multitask sparse learning (MTSL) [9]. For kNN and SVM, we take the extracted LBP feature directly and do not perform any dimension reduction techniques. For others, performance at different dimensions are performed for the completeness of comparisons. For each emotion category, we randomly select 4/5 of the images for training, and the remaining for testing. When applying our sLLE for recognition, we treat the test input as unlabeled data (and thus α = 1 in (1)). For both LLE and our sLLE, once training and test data are projected into the derived subspace, standard SVMs are applied for recognition. All SVMs in this paper are linear for fair comparisons. Table 1 compares the recognition results, in which we see that our method outperformed others in terms of both recognition rates and feature dimensions. We also show example recognition errors of our approach in Figure 4. From this figure, we see that the test images incorrectly recognized as wrong emotion categories actually look like the training ones with those emotion labels. This supports our argument that the use of existing facial expression data with predefined emotion labels might not be preferable for performing the task of emotion recognition. This is mainly due to its incapability in dealing with intra-class variations and the possible ambiguity between different emotions. 4.2. The Extended Cohn-Kanade Database We next consider the extended Cohn-Kanade database (CK+) [17], which consists of facial expression videos of 100 people, and each expression category has about 20 sequences. We disregard the videos without emotion labels, and we select one peak frame from each sequence in our experiments. Figure 5 shows example images of the CK+ database.

Fig. 6. Visualization of our dimension reduction results on (a) seven and (b) five (without fear and contempt) emotion categories for CK+.

4.2.1. Visualization The seven emotion categories of the CK+ database are anger, happiness, disgust, sadness, surprise, fear, and contempt. Since emotions of fear and contempt are not originally defined in the traditional V-A space, we choose to show the dimension reduction results with and without these two categories in Figure 6. Comparing the results in Figure 6, we see that the dimension reduction results of the facial expression images are consistent when considering the emotions of fear and contempt. Similar to our results for JAFFE, we observe that images of happiness are located in the first quadrant (i.e., positive valence and high arousal). For the images of fear, they are assigned into the second quadrant with moderately negative valence and high arousal values. The above results again support the use of our approach visualizing facial expression data in terms of their emotion information.

Table 2. Facial expression results for the CK+ data set. Acc Dim Acc Dim

Eigenface 31.3% 2 MTSL 34.4% 118

Eigenface 59.4% 10 MTSL 63.3% 472

Fisherface 59.15% 5 LLE 53.1% 10

kNN 18.8% 3776 Ours 51.6% 2

SVM 71.8% 3776 Ours 81.4% 10

solving emotion recognition problems. With reduced numbers of feature dimensions, our approach has been shown to achieve improved or comparable recognition performance as baseline or state-of-the-art emotion recognition works did. Acknowledgement This work is supported in part by National Science Council of Taiwan via NSC100-2221-E-001-018-MY2 and NSC 100-2218-E-224-017-MY3. 6. REFERENCES

Fig. 7. Example recognition errors for CK+. (a) Image of disgust recognized as angry, (b) a training image of angry, (c) image of angry recognized as disgust, and (d) a training image of disgust. 4.2.2. Emotion recognition We compare emotion recognition results on the CK+ database over different dimensions with those produced by Eigenface, Fisherface, kNN, SVM, LLE, and MTSL. For each emotion category, we randomly select 2/3 of the images for training, and the remaining for testing. Similarly, we consider test data without label information and thus have α = 1 in (1) when performing recognition. Table 2 compares the recognition results of different approaches. From this table, it can be seen that our method again outperformed others in terms of both recognition rates and feature dimensions. With example recognition errors of our approach shown in Figure 7, we also observe that several test images which were incorrectly recognized as wrong emotion categories actually look like the training ones with those emotion labels. In other words, these recognition errors were not only expected, these observations also confirm the use of our sLLE for automatically visualizing different facial expression data in terms of their emotion labels. With the ability to handle both intra and inter-class variations, our experimental results verify the use of our method for bridging the gap between facial expression analysis and emotion recognition. 5. CONCLUSION We proposed a novel supervised locally linear embedding (sLLE) algorithm for visualizing and recognizing different emotion categories using facial expression images. By advancing emotion labels of face images, our sLLE is able to derive a subspace which preserves both emotion label and local data structure of facial expression images. While a 2D version of our subspace has been shown to be equivalent to the valence-arousal space for emotion analysis, our experimental results confirmed that the use of our proposed method for

[1] S. Ioannou et al., “Emotion recognition through facial expression analysis based on a neurofuzzy network,” Neural Networks, 2005. [2] P. Ekman and W. V. Friesen, Unmasking the face : a guide to recognizing emotions from facial clues, Englewood Cliffs, N.J. : Prentice-Hall, 1975. [3] P. Ekman and W. V. Friesen, “Facial action coding system: Investigator’s guide,” Neural Networks, 1978. [4] Z. Zhang, “Feature-based facial expression recognition: Sensitivityanalysis and experiments with a multi-layer perceptron,” Int’l Jour. Pattern Rec. and AI, 1999. [5] R. A. Khan et al., “Human vision inspired framework for facial expressions recognition,” IEEE ICIP, 2012. [6] C. Shan, S. Gong, and P. W. McOwan, “Robust facial expression recognition using local binary patterns,” IEEE ICIP, 2005. [7] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, 1980. [8] D. de Ridder et al., “Supervised locally linear embedding,” ICANN, 2003. [9] Z. Lin, Q. Liu, Y. Peng, L. Bo, J. Huang, and D. N. Metaxas, “Learning active facial patches for expression analysis,” IEEE CVPR, 2012. [10] K. Sun, J. Yu, Y. Huang, and X. Hu, “An improved valencearousal emotion space for video affective content representation and recognition,” IEEE ICME, 2009. [11] S. W. Chew, S. Lucey, P. Lucey, S. Sridharan, and J. F. Cohn, “Improved facial expression recognition via uni-hyperplane classification,” IEEE CVPR, 2012. [12] R. Adolphs, “Recognizing emotion from facial expressions: Psychological and neurological mechanisms,” Behavioral and Cognitive Neuroscience Reviews, 2002. [13] Y.-H. Yang and H. H. Chen, “Machine recognition of music emotion: A review,” ACM Transactions on Intelligent Systems and Technology (TIST), 2012. [14] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, 2000. [15] Lyons, Akamatsu, Kamachi, and Gyoba, “The japanese female facial expressions,” 1998. [16] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE PAMI, 1997. [17] P. Lucey et al., “The extended cohn-kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression,” IEEE CVPR, 2010.