Automatic Temporal Segment Detection and Affect ... - Semantic Scholar

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

1

Automatic Temporal Segment Detection and Affect Recognition From Face and Body Display Hatice Gunes, Member, IEEE, and Massimo Piccardi, Senior Member, IEEE

Abstract—Psychologists have long explored mechanisms with which humans recognize other humans’ affective states from modalities, such as voice and face display. This exploration has led to the identification of the main mechanisms, including the important role played in the recognition process by the modalities’ dynamics. Constrained by the human physiology, the temporal evolution of a modality appears to be well approximated by a sequence of temporal segments called onset, apex, and offset. Stemming from these findings, computer scientists, over the past 15 years, have proposed various methodologies to automate the recognition process. We note, however, two main limitations to date. The first is that much of the past research has focused on affect recognition from single modalities. The second is that even the few multimodal systems have not paid sufficient attention to the modalities’ dynamics: The automatic determination of their temporal segments, their synchronization to the purpose of modality fusion, and their role in affect recognition are yet to be adequately explored. To address this issue, this paper focuses on affective face and body display, proposes a method to automatically detect their temporal segments or phases, explores whether the detection of the temporal phases can effectively support recognition of affective states, and recognizes affective states based on phase synchronization/alignment. The experimental results obtained show the following: 1) affective face and body displays are simultaneous but not strictly synchronous; 2) explicit detection of the temporal phases can improve the accuracy of affect recognition; 3) recognition from fused face and body modalities performs better than that from the face or the body modality alone; and 4) synchronized feature-level fusion achieves better performance than decision-level fusion. Index Terms—Affect recognition, affective face and body display, phase synchronization, selective fusion, temporal segment detection.

I. I NTRODUCTION

A

FFECTIVE computing aims to equip computing devices with the means to interpret, understand, and respond to human emotions, moods, and, possibly, intentions without the user’s conscious or intentional input of information–similar to the way humans rely on their senses to assess each other’s affective state. Building such systems could make user experience more efficient and amiable, customize experiences, and optimize computer-learning applications. Over the past 15 years, computer scientists have explored various methodologies to automate the process of recognition

Manuscript received April 13, 2007; revised December 30, 2007. This paper was recommended by Guest Editor M. Pantic. The authors are with the University of Technology Sydney, Broadway, NSW 2007, Australia. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2008.927269

of affect and emotions.1 One major present limitation of affective computing is that much of the past research has focused on emotion recognition from one single sensorial source or modality, i.e., the face display [54]. However, as natural human–human interaction is multimodal, the single sensory observations are often ambiguous, uncertain, and incomplete. While it is true that the face is the main display of a human’s affective state, other sources can improve the recognition accuracy. However, relatively few works have focused on implementing emotion recognition systems using affective multimodal data [54]. The idea of combining multiple modalities for sensing affective states, in turn, has triggered another research area: what modalities to use and how to combine them. The initial interest was on fusing visual and audio data. The results were promising: Using multiple modalities improved the overall recognition accuracy, helping systems function in a more efficient and reliable way. Although a fundamental study by Ambady and Rosenthal suggested that the most significant channels for judging behavioral cues of humans appear to be the visual channels of facial expressions and body gestures [3], emotion recognition via body movements and gestures has only recently started attracting the attention of computer science and human–computer interaction (HCI) communities [12], [34]. Following new findings in psychology, some researchers advocate that a reliable automatic affect recognition system should attempt to combine facial expressions and body gestures. Accordingly, recent approaches have been proposed for such sensorial sources (e.g., [5], [38], and [43]), with our system (hereinafter, the FABO system) being one of them [32]. In addition, with all these new areas, a number of new challenges have arisen. Studies show that temporal dynamics plays an important role for interpreting emotional displays [62]. It is believed that information about the time course of a facial action may have psychological meaning that is relevant to the intensity, genuineness, and other aspects of the expresser’s state. However, in spite of their usefulness, the complex spatial properties and dynamics of face and body gestures also pose a great challenge to affect recognition. Decoupling spatial extent from temporal dynamics significantly reduces the dimensionality of the problem, compared to simultaneously dealing with them. Other 1 Masters makes the following distinctions between affect, feeling, and emotion in [45]. Affect is an innately structured noncognitive evaluative sensation that may or may not register in consciousness. Feeling is defined as affect made conscious, possessing an evaluative capacity that is not only physiologically based but also often psychologically oriented. Emotion is psychosocially constructed dramatized feeling. In this paper, affect and emotion are interchangeably used, and externalizing the emotion is referred to as expression.

1083-4419/$25.00 © 2008 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS

independent works confirm the effectiveness of this approach (e.g., [25]). More importantly, detection (and decoupling) of the temporal phases and/or dynamics can effectively support automated recognition of affective states. The interest in the temporal dynamics of affective behavior is recent (e.g., [53] and [71]). Virtually very few of the existing monomodal facial expression/action unit (AU) detectors can handle the temporal dynamics of facial movement. There exist also a limited number of studies on the temporal segmentation of natural body gestures (e.g., [76]). These are reviewed in Section II. In summary, none of the few multimodal affect systems combining face and body gesture for emotion recognition have attempted to model the temporal dynamics of the combined face and body affective behavior and their relationship, and tackled the phase synchronization/alignment issue for data fusion. To address these, in this paper, we focus on bimodal affective face and body expressions, automatic detection of their temporal segments or phases, and recognition of the affective states. In other words, this paper describes the design and creation of the FABO system. Please note that, in this paper (and in the FABO system), the word modality is used to distinctively represent the face and body expression semantics rather than sensors of different natures. In a stricter sense, face and body expressions can be both seen as components of the visual modality. An earlier version of this paper may be found in [32]. There are three major deviations from the previous work that merit being highlighted: 1) the previous work could not detect the temporal segments of face and body gestures; 2) it used manually selected static frames of neutral and apex temporal segments of face and body images for affect recognition; and 3) during bimodal fusion, face and body display was assumed to be synchronous. The work proposed in this paper addresses all these limitations. The temporal segmentation of affective face and body display is achieved in a truly automatic way, and a phase-synchronization scheme is introduced to deal with simultaneous yet asynchronous bimodal data. Experiments are also extended to ten subjects, 12 affective states, and 539 videos. Overall, the work introduced in this paper offers the following six main contributions to the affect sensing/ recognition research field: 1) use of multiple visual cues and/or novel affect modalities, i.e., face and body expressions; 2) use of the first publicly available database to date to combine affective face and body displays in a bimodal manner; 3) analysis of nonbasic affective states such as anxiety, boredom, uncertainty, puzzlement, and neutral/negative/ positive surprise (see Fig. 4), in addition to the basic emotions of anger, disgust, fear, happiness, and sadness; 4) explicit detection of the temporal segments or phases of affective states/emotions (start/end of atomic emotions and subdivision into phases such as neutral, onset, apex, and offset) in order to decouple temporal dynamics from the spatial extent and reduce the dimensionality of the problem, compared to simultaneously dealing with them;

5) exploration of the usefulness of the temporal segment/ phase detection to the overall task of affect recognition with various experiments; 6) proposal of fusion of information from the different modalities by phase synchronization and selective fusion and proving the greater performance of this approach by comparative experiments. The FABO system presented in this paper uses “semispontaneous” affective bimodal data. A detailed discussion on this is provided in Section IV. The rest of this paper is organized as follows. Section II describes the background of emotions/affective states and temporal factors of face and body gestures, and summarizes the existing automatic approaches in the literature that attempt temporal segmentation of affective face and body display and multimodal affect recognition. Section III describes the details of the overall methodology chosen. Section IV presents the bimodal face and body gesture database created for automatic analysis of human nonverbal affective behavior (i.e., the FABO database) and describes the techniques employed by the FABO system for face and body detection, feature extraction, and tracking. Section V introduces the temporal segmentation procedure of the FABO system by comparing direct or sequence-based detection versus phase- or frame-based detection. Section VI focuses on the monomodal and bimodal recognition of affective states from face and body modalities. In particular, the bimodal recognition aims to analyze how bimodal face and body data can be fused at different levels by employing a synchronization scheme. Section VII presents the experimental results for both temporal segment detection and affect recognition, and Section VIII concludes the paper. II. B ACKGROUND AND R ELATED W ORK This section provides the background on affective states and temporal segments of affective face and body gestures, and presents related works attempting to analyze affective facial and bodily movements either as separate monomodal systems or in a multimodal affective framework. A. Background The leading study of Ekman and Frisen [24] formed the basis of visual automatic facial expression recognition. Their studies suggested that anger, disgust, fear, happiness, sadness, and surprise are the six basic prototypical facial expressions universally recognized. However, many issues in the emotion research field still remain under discussion, and psychologists do not seem to have reached consensus yet. While a significant number of researchers advocate the idea that there exists a small number of emotions that are basic as they are hard-wired to our brain and are universally recognized (e.g., [24]), there exist other researchers taking a dimensional approach and viewing affective states to be not independent of one another but related to one another in a systematic manner. Russell proposed that each of the basic emotions is a bipolar entity as part of the same emotional continuum. The proposed polars are arousal (relaxed

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. GUNES AND PICCARDI: AUTOMATIC TEMPORAL SEGMENT DETECTION AND AFFECT RECOGNITION

versus aroused) and valence (pleasant versus unpleasant) [60]. The aforementioned dimensional approach has been found relevant in representing affective states in HCI context and has been used by a number of researchers (e.g., [39]). Human recognition of emotions from body movements and postures is still an unresolved area of research in psychology and nonverbal communication. There are numerous works suggesting various opinions in this area. Coulson presented experimental results on the attribution of six emotions (anger, disgust, fear, happiness, sadness, and surprise) by human observers to static body postures of computer-generated figures [19]. From his experiments, he concluded that human recognition of emotion from posture is comparable to recognition from the voice, and some postures are recognized as well as facial expressions. Burgoon et al. discuss the issue of emotion recognition from bodily cues and provide useful references in [11]. They claim that affective states are conveyed by a set of cues and focus on the identification of affective states such as positivity, anger, and tension in videos from body and kinesics cues. In general, the body and hand gestures are much more varied than face gestures. There is an unlimited vocabulary of body postures and gestures with combinations of movements of various body parts. Despite the effort of Laban and Ullmann in analyzing and annotating body movement [41], unlike facial expressions, communication of emotions by bodily movement and expressions is still a relatively unexplored and unresolved area in psychology, and further research is needed in order to obtain a better insight on how they contribute to the perception and recognition of the various affective states. Ambady and Rosenthal reported that human judgment of behaviors jointly based on face and body proved to be 35% more accurate than that of behaviors based on the face alone [3]. In the light of such findings, automatic emotion/affect recognition does not aim to replace the facial expressions by body expressions as input; instead, the aim is to exploit the body expressions for better analysis and understanding of the overall affect conveyed. Furthermore, Van den Stock et al. investigated the influence of whole-body expressions of emotions on the recognition of facial and vocal expressions of emotion [73]. The recognition of facial expression was strongly influenced by the bodily expression. This effect was a function of the ambiguity of the facial expression. Overall, during multisensory perception, judgments for one modality seem to be influenced by a second modality, even when the latter modality can provide no information about the judged property itself or increase in ambiguity (i.e., cross-modal integration) [22], [30]. In [28], we provide a list of the face and body gestures and the correlation between the gestures and the emotion categories currently recognized by the FABO system. Fig. 4 also shows sample images of nonbasic facial expressions and their corresponding body gestures for neutral, negative surprise, positive surprise, boredom, uncertainty, anxiety, and puzzlement recognized by the FABO system. The temporal factors of a facial movement are described by four phases: neutral, onset, apex, and offset [23]. The neutral phase is a plateau where there are no signs of muscular activation, and the face is relaxed. The onset of the action/movement is when the muscular contraction begins and increases in

3

intensity and the appearance of the face changes. The apex is a plateau where, usually, the intensity reaches a stable level and there are no more changes in facial appearance. The offset is the relaxation of the muscular action. A natural facial movement evolves over time in the following order: neutral −→ onset −→ apex −→ offset −→ neutral. Other combinations such as multiple-apex facial actions are also possible. Similarly, the temporal structure of a body gesture consists of (up to) five phases: preparation −→ (prestroke) hold −→ stroke −→ (poststroke) hold −→ retraction. The preparation moves to the stroke’s starting position, and the stroke is the most energetic part of the gesture. Holds are optional still phases, which can occur before and/or after the stroke. The retraction returns to a rest pose (e.g., arms hanging down, resting in lap, or arms folded). Some gestures have a multiple stroke that includes small beatlike movements that follow the first stroke but seem to belong to the same gesture [76]. The most time-costly aspect of current facial or bodymovement manual annotation is obtaining the onset–apex– offset time markers. This information is crucial for coordinating facial/body activity, with simultaneous changes in physiology, voice, or speech [2]. B. Related Work In this section, we first review existing methods that achieve monomodal affect recognition and/or temporal segmentation from face or body display. Second, we summarize existing systems that combine face and body modalities in order to achieve multimodal affect recognition. 1) Monomodal Systems Analyzing Facial Expressions: As already mentioned in the introduction, most of the existing expression analyzers developed so far are monomodal and target human facial affect analysis by attempting to recognize a small set of prototypical emotional facial expressions, such as happiness and anger [55], [68]. There also exist a number of early efforts to detect nonbasic affective states such as attentiveness [36], fatigue [27], and pain [6] from face video. Most of the automatic facial expression analyzers focused on the recognition of facial expressions from posed data. Only recently works have been reported on automatic analysis of spontaneous facial expression data (e.g., [6], [16], [36], [70], [72], and [80]). Studies show that temporal dynamics plays an important role for interpreting emotional displays [62]. Accordingly, current research in automatic facial expression analysis has shifted its attention toward analyzing the spatio-temporal properties of facial features and modeling dynamic facial expressions or AUs (e.g., [65], [70], [79], and [81]) by implicitly incorporating the dynamics. Although the most common way of analyzing AUs has been that of independently and statically classifying each AU or certain AU combinations, more recently, it has been shown that exploiting the dynamics of AUs and the semantic relationships among them improves recognition [70]. A number of studies have detected the temporal segments of either facial expressions by using hidden Markov models (HMMs, e.g., [15] and [50]) or facial AUs by explicitly using other classification schemes, such as support vector machines



(SVMs) and AdaBoost (e.g., [53] and [71]). When detecting the temporal segments of an affective frame sequence, detection can be performed by either classifying each frame independently of the other frames or considering the sequential nature of the frame sequence as in a time series. Correspondingly, in the following, we will refer to these two approaches as framebased and sequence-based classifiers, respectively [15]. For the recognition of affective states, the HMM [59] and its variations [8], [52] are the most commonly used techniques. Such models can also be used for the detection of the temporal segments, provided that one can enforce some correspondence between the state values of the HMM and the temporal segments of the affective state. In the case of face display, the emissions (which are also known as observations or measurements) of the HMM are represented by a feature set computed based on the features of the face. 2) Monomodal Systems Analyzing Body Expressions: Compared to the facial expression literature, attempts for recognizing affective body movements are few, and efforts are mostly on the analysis of posed body expression data. Meservy et al. [46] focused on extracting body cues for detecting truthful (innocent) and deceptive (guilty) behaviors in the context of national security. They achieved a recognition accuracy of 71% for the two-class problem (i.e., guilty/innocent). Castellano et al. [26] presented an approach for the recognition of acted emotional states based on the analysis of body movement and gesture expressivity. They used the nonpropositional movement qualities (e.g., amplitude, speed, and fluidity of movement) to infer emotions (90% for anger, 44% for joy, 62% for pleasure, and 48% for sadness). In [76], a method for the detection of the temporal phases in natural gesture was presented. For body movement, a finite-state machine was used to spot multiphase gestures against a rest state. In order to detect the gesture phases, candidate rest states were obtained and evaluated. Three variables were used to model the states: 1) distance from rest image; 2) motion magnitude; and 3) duration. The overall recognition accuracy was not reported for this system. Other approaches have exploited the dynamics of the gestures without attempting to explicitly recognize their temporal phases or segments (e.g., [12], [26], and [46]). Overall, all of the aforementioned works focused on the body, without considering the facial expressions. 3) Multimodal Systems Analyzing Face and Body Expressions: The idea of combining face and body expressions for affect or emotion recognition is relatively new. Balomenos et al. [5] combined facial expressions and hand gestures for the recognition of six prototypical emotions. They fused the results from the two subsystems at a decision level using predefined weights. An accuracy of 85% was achieved for emotion recognition from facial features alone. An overall recognition rate of 94.3% was achieved for emotion recognition from hand gestures. However, they do not report the recognition accuracy for the fused data nor attempt to detect the temporal segments of face and body gestures. Kapoor and Picard focused on the problem of detecting the affective states of high interest, low interest, and refreshing in a child solving a puzzle [38]. They combined sensory information from the face video, the

posture sensor (a chair sensor), and the game being played in a probabilistic framework. The classification results obtained by Gaussian processes for individual modalities showed that the posture channel seemed to classify best (82%), followed by features from the upper face (67%), the game (57%), and the lower face (53%). Fusion significantly outperformed classification using the individual modalities and resulted in 87% accuracy. However, Kapoor and Picard do not focus on gestures of the hands and other bodily parts, and do not detect the temporal segments. Karpouzis et al. [39] fused data from facial, bodily, and vocal cues using a recurrent network to detect emotions. They used data from four subjects and reported the following recognition accuracies for a four-class problem: 67% (visual), 73% (prosody), and 82% (with all modalities combined). The fusion was performed on a frame basis, meaning that the visual data values were repeated for every frame of the tune. However, [39] neither automatically detects the temporal segments of the affective face and body gestures nor provides a solution to the issue of synchronization of the multiple modalities. Hartmann et al. in [33] defined a set of expressivity parameters for the generation of expressive gesturing for virtual agents. The studies conducted on the perception of expressivity showed that only a subset of parameters and a subset of expressions were well recognized by users. Therefore, further research is needed for the refinement of the proposed parameters (e.g., the interdependence of the expressivity parameters). Martin et al. presented a model of multimodal complex emotions involving gesture expressivity and blended facial expressions and proposed a representation scheme and a computational model for an agent [44]. However, their approach is at an exploratory stage and focuses on annotation and representation for synthesis purposes; it currently does not include feature analysis, recognition, and fusion of the multimodal data. Compared to these previous works, in this paper, 1) we use a higher number of hand gestures and body postures; 2) in addition to the six basic emotion categories, we further analyze emotion categories such as positive and negative surprise and nonbasic affective states such as anxiety, boredom, uncertainty, and puzzlement; 3) we explicitly detect the temporal segments or phases of affective states/emotions and explore the usefulness of this step to the overall task of affect recognition; and 4) we propose an innovative synchronization and selective fusion approach that allows us to achieve higher accuracy in the recognition of affective states. III. M ETHODOLOGY This section describes the details of the overall methodology chosen for the FABO system and the motivations behind this choice. In a multimodal affect recognition system, the kind of feature processing and fusion strategy to choose depends on the nature of the modalities to be fused. There might exist an inherent asynchrony between such modalities that is of great importance to the aim of fusion and will have to be considered in the methodology. For affect sensing or recognition, modality fusion is to combine and integrate, if possible, all incoming


Fig. 1. Phase synchronization and selective fusion strategy employed in the FABO system.

monomodal events into a single representation of the affect expressed by the user. Temporal analysis of affective multimodal data relies on time proximity between features from the different modalities [18], [48]. Hence, depending on how closely coupled the modalities are in time, there are two typical levels of integration for affect data: intermediate level (also known as feature-level fusion or early fusion) and high level (also known as decision-level fusion or late fusion). Fusion at the feature level is assumed to be appropriate for closely coupled and synchronized modalities (e.g., speech and lip movements) [78]. Multimodal data fusion architectures, in particular fusion at the feature level, assume a strict time synchrony between the modalities [18], [51]. Moreover, featurelevel fusion tends not to generalize well if it consists of modes that substantially differ in the time-scale characteristics of their features (e.g., speech and gesture input) [78]. Then, how is synchrony to be achieved and/or modeled when the modalities are temporally related but not strictly synchronous, like in the case of face and body? Fig. 1 provides an example of simultaneous

5

but nonsynchronous face and body displays over time from the FABO database. As can be seen in the figure, face movement starts earlier compared to body movement and has longer onset stage and longer apex stage (20 frames) compared to body movement (17 frames). So, how can a system effectively synchronize the face and body modalities? The choice of the FABO system’s methodology is driven by these aforementioned questions and the need to answer them. When input from two modalities is fused at the feature level, features from the different modalities should be made compatible, and a relationship between the different feature spaces should be explored, depending on how tightly coupled the modalities are. Various techniques have been exploited for such a purpose. For instance, dynamic time warping (DTW) has been used to find the optimal alignment between two time series if one time series may be nonlinearly “warped” by stretching or shrinking it along its time axis [61]. This warping between two time series can then be used to find corresponding regions between the two time series or to determine the similarity between them. Variations of HMM have also been proposed for this task. The pair HMM model was proposed to align two nonsynchronous training sequences, and an asynchronous version of the input/output HMM was proposed for audio-visual speech recognition [8]. Coupled HMMs and fused HMMs have been used for integrating tightly coupled time series, such as the audio and visual features of speech [52]. Bengio [8], for instance, presents the asynchronous HMM that could learn the joint probability of pairs of sequences of audio-visual speech data representing the same sequence of events (e.g., where sometimes the lips start to move before any sound is heard). We argue that, when applied to multimodal affect data from face and body, synchronization can be obtained at the feature level through phase synchronization. Phases exist in the time series of the feature vectors of face and body due to their semantics and anatomical constraints. Moreover, due to the particular nature of our data, we know a priori that the phases are finite and evolve in a specific order: neutral–onset–apex–offset– neutral. Pikovsky shows that traditional techniques ignoring the phases of signals are less sensitive in the detection of the systems’ interrelation [57]. Therefore, we focus on the phases or so-called temporal segments in order to interrelate the face and body modalities. For feature-level fusion, we detect the temporal segment of each frame, and we fuse frames from the two modalities if they belong to the same temporal segment. The question that follows is “Are frames from the various temporal segments equally useful to support recognition of affective states?” We further argue that frames from the “apex” segment should be chosen since the spatial extent of the face and body features is at its maximum, and affective states can be more effectively discriminated. Moreover, during the apex and neutral phases, muscular actions are at a plateau, and the values of face and body features can be modeled as being drawn from a constant underlying probability distribution: This makes it possible to apply robust methods based on multiple frames to increase the recognition accuracy of the affective state. Our methodology is based on the aforementioned assumptions, and we set to prove our argument with experimental results.



Fig. 2. Functional blocks of the proposed FABO system.

The proposed approach is depicted in Fig. 2 and can be described in three steps. 1) Each frame from the face and body modalities is first classified by temporal segment classifiers into a temporal segment. 2) Feature vectors from frames classified as apex for both the face and body modalities, i.e., fFa and fBa , are selected and used for classification. 3) • If feature-level fusion is selected, feature vectors fFa and fBa are paired in appearance order from their respective sequences and combined into one, a a . fFB is fed into a classifier for bimodal i.e., fFB affect recognition, providing classified affective state CFB . This step provides the so-called phase synchronization. • If decision-level fusion is selected, monomodal classification from both modalities is first performed by providing classes CF and CB as decisions or posteriors. A decision-level fusion criterion is then selected from a number that is available to provide the eventual bimodal affect recognition CFB . For classifying the individual frames, several different classifiers—both frame-based and sequence-based (see Section II)—have been employed, and the results were compared. As a sequence-based classifier, we have used HMM [59]. As frame-based classifiers, we have used the several

different algorithms provided in the Weka tool, including SVM, Adaboost, and C4.5 [77]. For the temporal segment detection in step 1, all the frames in the emotional sequence are classified into one temporal segment. For affect recognition in steps 2 and 3, frame-based classifiers are applied only to the apex frames, whereas sequence-based classifiers are still applied to whole sequences. The details of these steps, the data used, and the experiments conducted are explained in detail in the succeeding sections. IV. D ATA AND F EATURE S ETS This section provides the details of the bimodal face and body gesture database (i.e., the FABO database), together with the feature sets and extraction algorithms employed by the FABO system for the face and body modalities. A. FABO Database The most challenging problem faced in designing a system was the lack of a benchmark database that consisted of affective bimodal face and body recordings. All existing databases introduced for affect recognition were explored and analyzed. (The details of this procedure were presented in [30].) However, there was no readily available database combining affective face and body information in a bimodal manner. Therefore, the


Fig. 3. Sample images from the FABO database separately recorded by the (left) face and (right) body cameras.

first step for the FABO system was to create a bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior (explained and presented in [29] in detail). The FABO database consisted of recordings of subjects that simultaneously performed face and body gestures with two cameras: the face camera and the body camera. Fig. 3 shows sample images from the FABO database that were separately recorded by the face (left) and body (right) cameras. The recordings were obtained by using a scenario approach, where subjects were provided with situation vignettes or short scenarios describing an emotion-eliciting situation. They were instructed to imagine these situations and act out as if they were in such a situation. More specifically, although the FABO database was created in laboratory settings, the subjects were not instructed on emotion/case basis as to how to move their facial features and how to exactly display the specific facial expression. A recent discussion in the affective computing field is the creation and use of posed versus spontaneous databases. Affect data may belong to one of the following categories: spontaneous (i.e., occurring in real-life settings, such as interviews or interactions between humans or between humans and machines), induced (i.e., occurring in a controlled setting and designed to create an affective activation or response such as watching movies), or posed (i.e., produced by the subject upon request) [9]. The FABO system presented in this paper uses posed affective bimodal data. According to Banziger and Scherer [9], posed affect data offer the possibility of recording a high number of variable expressions with good quality and standard content for a number of individuals. With spontaneous data, it will only be possible to record just a few emotional reactions, and the comparability of emotional reactions will be reduced. Posed data also offer the possibility of using the analysis of affect data to test competing theories and yield important insights for future research [9]. In the light of this discussion on posed versus spontaneous data, the FABO database can be described as a “ semispontaneous” affect database. (The readers are referred to [29] for details on data acquisition, gestures performed, directions to subjects, and others.)

7

Fig. 4 shows sample images of nonbasic facial expressions and their corresponding body gestures for neutral, negative surprise, positive surprise, boredom, uncertainty, anxiety, and puzzlement. For the annotation of the bimodal data (each face and body video separately) into affect or emotion categories, we developed a survey by using the labeling schemes for affective content (e.g., happiness) and signs (e.g., how contracted the body is) by asking six independent human observers. For the temporal segment annotation, one human coder (i.e., one of the authors) repeatedly viewed each face and body video in slowed and stopped motion to determine when (in which frame) the neutral–onset–apex–offset–neutral phases start and end [31]. Note that, more recently, other data sets containing posed, induced, or naturalistic data have also been introduced (e.g., [9] and [20]). B. Face Feature Extraction There exists an extensive literature for human face detection, feature extraction, and expression recognition. In the context of affect recognition, we only briefly summarize the trends in existing facial expression recognition approaches prior to presenting the description of the feature extraction employed in the FABO system. The existing facial expression/AU recognition approaches use appearance-based (e.g., texture) or geometric feature-based (e.g., points or shapes) methods and employ statistical or ensemble learning techniques for recognition [54]. Examples of the geometric feature-based methods are auxiliary particle filtering (e.g., [53] and [71]) and the piecewise Bezier volume deformation tracking, which use an explicit 3-D wireframe face model to track the geometric facial features defined on the model (e.g., [63] and [64]). Examples of the appearancebased methods are Gabor-wavelet-based methods (e.g., [7]). Active appearance models instead exploit both geometric- and appearance-based features for facial expression recognition (e.g., [69]). In our system, we decided to use a combination of appearance (e.g., wrinkles) and geometric features (e.g., feature points) for face feature extraction. For face feature extraction, only the videos obtained from the face camera are used. The steps employed can be summarized as follows: modeling of the face, frame-by-frame face detection, facial feature extraction (extraction of the eye, eyebrow, nose, mouth, and other facial regions), motion analysis, and comparison between the reference frame (neutral expression frame) and other consecutive frames. Please note that these steps are only briefly summarized in this paper due to limitation in space but are described in detail in [28]. 1) Face Model: The model defined in the FABO system is a frontal-view face model that consists of feature bounding rectangles called regions of interest (ROIs). The FABO system first automatically locates the eight facial features, i.e., eyes, eyebrows, lips, nose, chin, and forehead, in the neutral frame. Then, it computes the bounding rectangle for each of these features. The permanent feature rectangles are defined as



Fig. 4. Sample images of (a1–h1) nonbasic facial expressions and (a2–h2) their corresponding body gestures. (a) Neutral. (b) Negative surprise. (c) Positive surprise. (d) Boredom. (e) Uncertainty. (f–g) Anxiety. (h) Puzzlement.

follows: the forehead, upper and lower eyebrows, upper and lower eyes, nose, upper right lip, lower right lip, upper left lip, lower left lip, and chin regions. Five bounding rectangles are also defined for transient features located between permanent features, i.e., the region(s) between the eyes and the eyebrows, corner of right eye, corner of left eye, right cheek, and left cheek. The steps taken during the initialization of the face model are further described in the succeeding sections. 2) Face Detection: For face detection, the current state of the art is based on the robust and well-known method proposed by Viola and Jones [74] and improved by Lienhart and Maydt based on a set of rotated Haarlike features [42]. The FABO system adopted the fast and robust stump-based 20 × 20 Gentle AdaBoost (GAB) frontal face detector to detect the face region and a similar approach for training a separate classifier for each facial feature (eyes, lips, etc.). This approach can handle in-plane rotation and tolerate variations in lighting. In the second stage of the face detection, the boundaries of the face region obtained are used to crop the initial color image from the face camera. Hence, only the face region remains as the object of interest. In the third stage, the cropped image is used for face region segmentation using color cues. The aim of the skin color segmentation is to find regions in the facial image that represent the nonskin features of a face (i.e., eyes and lips) and remove any background pixels that were returned by GAB-based detection as part of the face region.

3) Face Feature Extraction: Extracting human facial features, such as the mouth, eyes, and nose requires that Haar classifier cascades first be trained. In the FABO system, in order to train the classifiers, GAB and Haar features are used. To train the classifiers, two sets of images are needed. The negative set of images does not contain the object that is going to be extracted (in this case, a facial feature). For training the classifiers, more than 5000 negative examples, consisting of a dog, cat, nature, hand, pedestrian, and car images, were used. The positive images contain one or more instances of the object. In order to produce the most robust facial feature extractor possible, the positive images need to be representative of the variance between different people, including race, gender, and age. Thus, the feature extractors were trained and tested on the Facial Recognition Technology (FERET) database, a popular publicly available image set [1]. The database we used consists of 14 051 8-bit grayscale images of more than 1000 people, with views ranging from frontal to left and right profiles, obtained under various illumination conditions and with neutral/alternative facial expressions and taken at pan angles ranging from 0◦ to 45◦ from a frontal view. Four separate classifiers were trained for eyebrows, eyes, nose, and mouth. For the training, we used 1500 positive and 5942 negative samples for the eye and eyebrow classifiers, 1891 positive and 5018 negative samples for the nose classifier, and 2366 positive samples and 5942 negative samples for the mouth classifier. Once the classifiers were trained, they were


TABLE I EVALUATING THE PERFORMANCE OF THE TRAINED CLASSIFIERS

tested on a separate set of images from a different set of subjects in the FERET database. Table I presents the details on the evaluation of the trained classifiers. The eye, eyebrow, and nose extractors, in general, are able to extract the interest regions even when the face is rotated and the eyes are closed, and there are variations in size, scale, and illumination in images. Some of the falsely extracted regions are located outside the borders of the face region (e.g., ears), whereas some are located within the borders of the face region. However, it is possible to remove both types of false alarms during the actual feature extraction process by applying the extractors only on a particular search region. The search region for each feature is selected/restricted using the anatomical constraint(s). For instance, the eyes reside in the upper part of the face region; therefore, the search space/region for the eyes is limited to the upper part of the face. For the mouth classifier, the false positive rate obtained was very high (see Table I). This can be explained by the fact that the mouth is one of the most difficult facial attributes to analyze as its shape is very versatile, and the tongue and teeth can be disturbing appearing and disappearing properties. However, in our system, we improve this performance during the extraction process by using two additional criteria: 1) search region selection (i.e., using the anatomical constraint that the mouth resides in the face region constrained by the eye and nose regions, etc.) and 2) color-based constraint application. When false mouth candidates are extracted, a simple algorithm is applied: The biggest extracted mouth candidate in the specified part of the face is chosen. To better localize the lips, color information is combined with the output of the GAB extractor. Applying the aforementioned steps and constraints boosts the accuracy of the mouth extractor by an order of magnitude. Overall, the FABO system uses face feature extractors trained on whole images; however, during the extraction process, it separately applies these extractors on the search regions for each facial feature. The ROI rectangles for the extracted facial features are later used together with the rotated and scaled expressive frame to calculate the following: 1) general change within the feature (e.g., how the height/area of the feature increased or decreased); 2) texture/motion; 3) optical flow; and 4) edge density change in this region with respect to the neutral frame. Further details on these procedures are provided in [28]. The summary of the features extracted from the face modality is presented in Table II, and sample images are shown in Fig. 5.

9

TABLE II SUMMARY OF THE FEATURES EXTRACTED FROM THE (TOP) FACE AND (BOTTOM) BODY MODALITIES. THE NUMBER OF FEATURE ( S ) EXTRACTED FOR THE C ORRESPONDING R EGION I S S HOWN IN P ARENTHESES


Fig. 5.


Example of the face feature extraction employed in the FABO system.

C. Body Feature Extraction and Tracking There exists an extensive literature for body feature extraction, tracking, and gesture recognition from video sequences. In the context of affect recognition, we only concisely summarize the trends in existing gesture recognition approaches before presenting the description of the body feature extraction and tracking employed in the FABO system. The existing approaches for hand or body gesture recognition and analysis of human motion, in general, can be classified into three major categories: 1) model-based (i.e., modeling the body parts or recovering 3-D configuration of articulated body parts); 2) appearance-based (i.e., based on 2-D information such as color/grayscale images or body silhouettes and edges); and 3) motion-based (i.e., directly using the motion information without any structural information about the physical body) [25]. In the aforementioned approaches, DTW and HMM are typically used to handle the temporal properties of the gesture(s). Color as a distinct feature has widely been used for representation and tracking of multiple objects in a scene. Several tracking methods have been used in the literature; among them are the Kalman filter, condensation tracking, mean shift tracking, and CamShift tracking [21]. Dreuw et al. [21], for instance, present a dynamic programming framework with the possibility of integrating multiple scoring functions (e.g., eigenfaces) or arbitrary objects, and the possibility of tracking multiple objects at the same time. Overall, most of the existing hand/body gesture recognizers work well in relatively constrained environments with relatively small changes in terms of illumination, background, and occlusions [54]. There are also more recent works using different (or a combination of) tracking schemes, depending on what they aim to track and recognize. A sample system is that of Valstar et al. [72], which uses a cylindrical head tracker to track the head motion, particle filtering with factorized likelihoods to track fiducial points on the face, and auxiliary particle filtering to track shoulder motion. Various techniques for extracting and tracking specific features such

Fig. 6. Example of the shoulder extraction procedure. (First row) Neutral frame and expressive frame. (Second row) Shoulder regions found and marked on the neutral frame, estimating the movement within the shoulder regions using optical flow.

as the shoulders have also been proposed in the literature (e.g., [49]). Our work focuses on communicative affective gestures generated with one hand or two hands, head, shoulders, or combinations of these. The feature extraction, analysis, and tracking procedures presented in this section are applied on the videos obtained from the body camera only. The main steps can be summarized as follows: The static background model of the observed space is created before detection can start, the head region is extracted from the images using cascaded classifiers, skin colored regions are extracted from the images using skinregion segmentation and connected component labeling, and tracking of each ROI is obtained with the CamShift technique [10]. We chose to use the CamShift tracker, because it is one of the best single-cue trackers, it is quite efficient and robust when the object color remains the same, and there is no similar color in the background [13]. Compared to typical particle filters, its computational requirements are less intense [13]. The body model employed in the FABO system is a combination of silhouette-based and color-based body models to determine the image location of the main body parts while the person is in a sitting posture. The height of the bounding box of the silhouette is taken as the height of the body model. Then, fixed vertical scales are used to determine the initial approximate location (bounding box) of individual body parts. The height of the initial bounding box for the head is set to be two fifth times the body silhouette height. The height of the initial bounding box for the torso is set to be three fifth times the body silhouette height. The width of the bounding boxes of the head and torso are calculated by finding the median width (horizontal linewidths) inside their initial bounding boxes. In addition to finding sizes and locations, the principal axis of the foreground pixels inside the initial bounding boxes is computed in order to estimate the pose of the body parts. The torso is located first, followed by head, shoulders (see Fig. 6), and hands. For each video frame, the raw image is converted to a color probability distribution via the color histogram model of the


11

length, and width are perturbed very little. However, the roll angle is more strongly affected since even a small intersection of a distractor in the CamShift’s search window can change the orientation of the flesh pixels. Overall, the FABO system extracts the ROI rectangles for the extracted body features, which it later uses to calculate the following: 1) general change within the feature (e.g., how the centroid, rotation, length, width, and area of the feature increased or decreased); 2) texture/motion; and 3) optical flow in this region with respect to a neutral frame. Further details on these procedures can be found in [28]. A summary of the features extracted from the body modality is provided in Table II, and sample images are shown in Figs. 6 and 7. Once the features from the face and body modalities are extracted, they are used for the later stages of temporal segment detection and affect recognition, as described in the succeeding two sections. V. T EMPORAL S EGMENT D ETECTION

Fig. 7. Example of the CamShift tracking of the head and hands employed in the FABO system. Hands get very close and, at some stage, occlude each other. The skin region of the hands merges into one blob. Tracking proceeds without problems. As soon as the hands separate from each other, the tracker is reinitialized (frame 45), and they are tracked as separate regions again (frame 46). The roll variable for the head region is affected due to the arbitrary orientation of the hand as it gets closer to the search window of the head region.

skin region being tracked. CamShift calculates the centroid of the color probability distribution, recenters the window, and then calculates the area for the next window size. If the region cannot be tracked, the algorithm is reinitialized. In the FABO system, when two hands merge or when the hands touch the facial region, due to their skin colors being similar, they are segmented as a single foreground region by the CamShift algorithm. The merged region is tracked until it splits back into its constituent objects (face and hands, or hand and hand). When the merged region splits, the localization procedure is run again to obtain and reinitialize the current location of each region (see Fig. 7). An important point to note is that, when the hands move closer to the head region, the CamShift algorithm operates under a special condition called “tracking in the presence of distractions” [10]. As reported by Bradski [10], if the “distractor” does not intersect with the ROI, much of the CamShift’s search window, distribution’s centroid,

This section focuses on the automatic detection of the face and body temporal segments. Face and body videos from ten subjects from the FABO database were processed, and a total of 152 features for face modality and 170 features for body modality were extracted. These feature vectors were used for the detection of face and body temporal segments with various classifiers—both frame-based and sequence-based—and the performances were compared. While sequence-based classifiers have the potential to achieve more accurate classification by exploiting the temporal aspect of the sequence, they are more challenging to train due to the greater number of parameters that they need to learn. Therefore, performance comparison is provided in the following. Detection of temporal segments can be performed before, after, or simultaneously with affect recognition. Other works have chosen to first detect the expressions and apply temporal analysis after (e.g., [71]). As previously explained, the FABO system, instead, employs temporal segment detection as the first stage in its affect recognition approach. The goal is that of segmenting the “neutral” and “apex” frames from the rest of the sequence in order to base affect recognition only on them. Further reasons for this choice are given here. 1) Since features are at their maximum spatial extent during the apex phase, discrimination of affective states proves to be more effective and robust to noise and feature extraction errors. 2) Decoupling spatial extent from temporal dynamics significantly reduces the dimensionality of the problem compared to simultaneously dealing with them. Other independent works confirm the effectiveness of this approach (e.g., [25]). 3) We argue that the probability distributions of feature vectors during the apex segment can be approximated as constant. The affective state during apex segments can thus be modeled by simpler more robust static-state estimation approaches, instead of dynamic-state approaches, such as HMM [4], [14].



A. Sequence-Based Detection For sequence-based detection of face and body temporal segments, we use the HMM modeling the temporal segment as the hidden state to be estimated. An HMM is a statistical model λ(A, B, π) fully described by three sets of quantities: 1) state transition probabilities A; 2) emission (or observation) probabilities B; and 3) the probabilities of the initial states π. The states are only allowed to assume discrete values, for example, N values. Therefore, A and π can be represented by an N × N matrix and an N dimensional vector, respectively. Instead, observations are often drawn from continuous variables, and as such, emission probabilities need to be modeled by probability density functions. The most common approach for modeling emission probabilities is by the use of mixtures of Gaussians [56]. In such a case, B is fully described by the weights, means, and covariances of all the Gaussian components. Given one or more sequences of observations, the Baum–Welch algorithm can be used for learning a corresponding HMM with maximum likelihood. This algorithm is an expectation–maximization (EM) algorithm that learns A, B, and π in a simultaneous manner and is guaranteed to converge to a local optimum in the parameter space [59]. Other works in the affect recognition literature mainly use HMMs for classifying the sequences into one of the emotion categories. Instead, here, we aim to use the HMM for decoding the temporal segments; therefore, our problem is different. For decoding, we use only a single HMM for any incoming sequence, which is consistent with our decision to perform temporal segment detection prior to affect recognition. Accordingly, training was performed with frame sequences from all the affective states. Moreover, since, in this application, the state can only evolve toward successive states, we used a left-to-right HMM with the same number of states as that of the possible temporal segments. As we trained the HMM without the use of the state’s ground truth, the mapping of the five states of the HMM onto the ordered temporal segments of the affective sequence was not enforced during training. However, it seems that the learned HMM well embeds such a mapping. As both the face and body feature vectors are highly dimensional data sets (152 and 170 features, respectively), we applied a dimensionality reduction technique [principal components analysis (PCA)] in order to lower their dimensions prior to HMM-based temporal segment detection. PCA aligns the data along the directions of the greatest variance [75]. The first part of the PCA procedure is to compute the eigenspace. The eigenvectors are constructed from an initial set of feature vectors by applying PCA to the covariance matrix, after first subtracting the mean vector. The output is a set of eigenvectors and their corresponding eigenvalues. Only the eigenvectors corresponding to the highest eigenvalues, capturing the highest variance within the data set, are kept to define the eigenspace. We first experimented on finding how many eigenvectors to retain in order to capture 95% of the variance within the data set. For both the face and body, 95% of the variance within the data sets seemed to be retained in one eigenvector corresponding to the highest eigenvalue. In order to retain as much variance as possible, we decided to reduce the dimensionality of the face

and body feature vectors from 152 down to 4 and from 170 to 4, respectively, by keeping the first four eigenvectors. Alternatively, feature dimension reduction could also be simultaneously achieved with the estimation of HMM parameters (e.g., [67]). The HMM model used for decoding the temporal segments λT can be described here. • Number of states: N = 5 (S1: neutral, S2: onset, S3: apex, S4: offset, S5: neutral). • Initial state probabilities: Π = [1 0 0 0 0] as all sequences start from the “neutral” segment. • Initial state transition probability matrix (5 × 5, row-major order): A = [0.9 0.1 0 0 0; 0 0.8 0.2 0 0; 0 0 0.8 0.2 0; 0 0 0 0.9 0.1; 0 0 0 0 1]. Due to the nature of the data, this is a “left-to-right” HMM, where we expect the states to only move forward and not come back. • Input data dimension (i.e., number of features): D = 4. • Number of Gaussians per state: M = 2. • Covariance type: spherical. • Initial observation means: μ1 = μ2 = μ3 = μ4 = μ5 = [1 15 16 30]. • Initial observation variances: σ1 = σ2 = σ3 = σ4 = σ5 = [0.01 5 0.01 5]. • Weight for each of the Gaussian components: randomly generated from a uniform distribution. B. Frame-Based Detection Frame-based detection only uses the features from the current frame. For the experiments, Weka, a tool for automatic classification, was used [77]. For classification purposes, several different classifiers have been employed, and the results were compared. These are C4.5; Random Forest; BayesNet; SVMs with sequential minimal optimization (SMO), i.e., SVMSMO; Multilayer Perceptron; and AdaBoost [77]. C4.5 is a generator of supervised symbolic classifiers based on the notion of entropy since its output—a decision tree—consists of nodes created by minimizing an entropy cost function. A Random Forest classifier uses a number of decision trees in order to improve the classification rate. Eventually, the forest combines all the votes obtained from all the trees in the forest and chooses the class with the majority votes as the predicted class. BayesNet enables the use of a Bayesian network learning with various search algorithms and quality measures. Various estimator algorithms for finding the conditional probability tables of the Bayes network can be used, i.e., the SimpleEstimator, BMAEstimator, and others. SMO is a fast method that implements Platt’sSMO algorithm for training a support vector classifier [58]. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. Multiclass problems are solved using pairwise classification. The multilayer perceptron (MLP) can be seen as the simplest kind of feedforward neural network that uses backpropagation to train. An MLP consists of several layers of neurons, and each layer is fully connected to the next one. The linear neurons are modified, so that a slight nonlinearity is added after the linear summation. Boosting (e.g., AdaBoost) takes a selected classifier as the base classifier and a training set as input and


runs the base classifier multiple times by changing the distribution of training set instances. The generated classifiers are then combined to create a final classifier that is used to classify the test set. They usually work best on unstable classifiers such as decision trees that suffer from high variance because of small perturbations in the data. For the frame-based classifiers, we also considered the possibility of adopting dimensionality-reduced feature sets. As reported in [35], an acceptable rule of thumb for the size of the training set is that of using at least ten times as many training samples per class as the number of features. Since the size of our training set satisfies this rule, we decided not to perform any dimensionality reduction for temporal segment detection in the case of the frame-based classifiers. The experiments conducted using the aforementioned classifiers are described in detail in Section VII.

13

The initialization of the variables for the EM algorithm was carried out as that for the temporal segment detection of Section V. • Number of states: N = 5. • Initial state probabilities: Π = [1 0 0 0 0]. • Initial state transition probability matrix (5 × 5, row-major order). • Input data dimension (i.e., the number of features): D = 4. • Number of Gaussians per state: M = 2. • Covariance type: spherical. • Weight for each of the Gaussian components: randomly generated from a uniform distribution. Eventually, using the sum criterion, decision-level fusion was applied to the HMM results for classifying video sequences of affective face and body display. B. Frame-Based Classification

VI. A FFECT R ECOGNITION In this section, we address the recognition of affective states, a problem with even larger dimensionality than that of temporal segment detection. We obtain affect recognition in two ways: sequence-based and frame-based detection from the bimodal analysis of a video. We also provide details of the monomodal affect recognition merely in order to obtain a term of reference for the performance of the bimodal affect recognition. A. Sequence-Based Classification For affect recognition with the sequence-based classifier, we separately created the eigenspace for the face and body, by using 7000 frames for the face and 7000 for the body, labeling the 12 affective states and the four temporal segments. We then projected both the training and testing sets onto the corresponding eigenspace and obtained the reduced feature vector representation. For both the face and body feature vectors, we chose, again, to use the first four principal components, thus reducing their representation from 152 down to 4 and from 170 to 4. As a next step, we obtained the classification of the affective states by applying the HMM on the video sequences. Each affective state was modeled by a different emotion-specific HMM. Thus, during training, we obtained 12 HMM models λFi for the face and 12 HMM models λBi for the body modality. During testing, for a given face sequence, the aim is to find how well a particular model λFi matches the sequence of observations SFi . To this aim, we compute p(SFi |λFi ) for each model λFi using the Forward procedure. Then, the λFi with the highest likelihood gives us affective state i∗ for sequence SFi out of the 12 possible affective states. This can be described as follows: 12

assign SFi → i∗ , i∗ = arg max p(SFi |λFi ). i=1

A similar procedure is applied for the body modality by training 12 HMM models, i.e., λBi , given as follows: 12

assign SBi → i∗ , i∗ = arg max p(SBi |λBi ). i=1

As the number of classes for the problem of affect recognition is larger than that of temporal segment detection, dimensionality reduction seemed to be advisable also in the case of the frame-based classifiers. Feature selection was preferred to PCA in this case since the needed reduction was lower, permitting us to retain a large number of the original features, together with their physical interpretation [35]. The number of features was decreased in an organized manner by using the chi-squared ranking filter; eventually, the first 50, 45, and 95 features were retained for the face feature vector, the body feature vector, and the fused face and body feature vectors. The overall dimensionality of such reduced feature sets is lower than that of the reduced feature set for the sequence-based classifier (i.e., four times the number of frames in the frame sequence): Therefore, the performance comparison presented in Section VII is fair to the sequence-based classifier. Following the dimensionality reduction step, frame-based classification was performed in two steps. 1) All of the frames from the face and body videos were first classified by temporal segment classifiers in order to detect their temporal segment. 2) Only feature sets from the apex frames obtained at step 1 were used by the affective state classifiers. For feature-level fusion, as shown in Figs. 1 and 2, the feature vectors for apex frames from the face and body modalities were fused one by one in a linear manner until one of the modalities ran out of frames previously selected as apex by the temporal segment classifiers (the body modality in Fig. 1). This procedure is also illustrated in Fig. 8 when inferring the affective state puzzlement from its corresponding facial and bodily expressions. The figure shows how apex frames from affective face and body display are paired to the aim of fusion when they reach their apex phase at different times. We chose to use linear frame pairing, instead of exhaustive frame pairing, to limit the computational complexity. The concatenated feature vectors were first dimensionally reduced and then input into various classifiers for classification into affective state categories. Results are presented in Section VII. In decision-level fusion, separate classifiers for the face and body first processed their respective data stream to produce two



Fig. 8. Illustration of the how the apex frames from the affective face and body display are paired to the aim of fusion when they reach their apex phase at different times, for the affective state puzzlement: (First column) The face display reaches its apex phase (frame 71) while the body gesture (frame 71) is still at its onset phase. (Columns 2–5) Pairing of the face (frames 93–96) and body (frames 99–102) apex frames while the actual body gesture has not reached its apex phase. (Last column) Pairing of the face (frame 100) and body (frame 106) apex frames when the actual body gesture has also reached its apex phase. TABLE III DESCRIPTION OF THE THREE LATE-FUSION CRITERIA USED

TABLE IV CLASSIFICATION RESULTS OF THE FRAMES FROM THE FACE AND BODY VIDEOS INTO TEMPORAL SEGMENTS

VII. E XPERIMENTAL R ESULT separate sets of outputs. The outputs were then combined to produce the final hypothesis. Designing optimal strategies for decision-level fusion is still an open research issue, depending also on the framework chosen for optimality. Various approaches have been proposed, including the sum rule, product rule, use of weights, maximum/minimum/median rule, majority vote, etc. [40]. The first three techniques were analyzed in the FABO system, i.e., the sum, product, and weighted sum criteria. The approach of late integration of the individual classifier outputs can be described as follows: Temporally segmented apex feature vectors for the face and body are represented by fFa and fBa , respectively. Under a maximum a posteriori (MAP) approach, the most likely class out of M possible classes (C1 , . . . , Ck , . . . , CM ) is that having maximum posterior probability p(Ck |fFa , fBa ). A feature-level fusion approach explicitly computes such a probability. In late integration, on the other hand, two separate classifiers provide posterior probabilities p(Ck |fFa ) and p(Ck |fBa ) for the face and body, respectively, to approximate p(Ck |fFa , fBa ) with one of the fusion methods described in Table III. We assumed equal priors for all classes. When analyzing bimodal data, is one of the modalities “primary,” in the sense of being more discriminative on its own? While developing the FABO system, we initially assumed that the face modality was the primary modality (as presented in [32]). However, this intuitive assumption was proven to be incorrect by the experiments on monomodal affect recognition carried out for this paper. Therefore, based on the confidence measures obtained from the monomodal affect recognition results, the following weights were chosen for the weighted sum criterion: σF = 0.3 for the face modality and σB = 0.7 for the body modality.

A. Temporal Segment Detection For sequence-based temporal segment detection, 286 training video sequences were used in order to train a single HMM model. For testing, 253 testing sequences were fed into the model, and the state sequence of each of them was decoded. The Viterbi algorithm, which provides the best interpretation, given the entire context of the observations, was utilized for decoding. Results were then compared with the ground truth of the testing data. An overall detection rate of 28.7% for the face and 37.2% for the body was achieved. Overall, although the performance of the HMM is facilitated by a number of constraints, the detection rate achieved proved to be very low. For frame-based temporal segmentation, the available frames from the face and body video sequences were partitioned into two independent data sets. For the face, a training data set of 19 625 frames with 152 features and a test data set of 16 759 frames were used. Similarly, for the body, a training data set of 19 584 frames with 170 features and a test data set of 16 783 frames were used. Table IV presents the results obtained for the face and body videos with various classifiers. The best detection results for the face videos were obtained with SVM-SMO, 9598 frames out of 16 759 were correctly classified, and a detection rate of 57.27% was achieved. For the body modality, the best detection results were obtained with C4.5, 13 538 frames out of 19 584 were correctly classified, and a detection rate of 80.66% was achieved. Table V presents the full “confusion matrix” for the detection of the temporal segments from the face and body feature vectors. Overall, body movements are more distinguishable for the detection of the temporal segments than facial movements. Although


TABLE V CONFUSION MATRICES FOR CLASSIFICATION RESULTS OF THE FRAMES FROM (TOP) FACE AND (BOTTOM) BODY VIDEOS INTO TEMPORAL SEGMENTS

the distinction of the temporal segments is still possible with high confidence for the body modality, such a task becomes more challenging for the face modality. This is explained by the fact that facial movements are subtle movements compared to the body; in transition stages, temporal segments are easily confused with one another, and even high resolution might not be able to provide absolute correct detection results. Overall, our experimental results show that the frame-based classification outperforms the sequence-based classification in the task of temporal segment detection. This can be explained by the fact that sequence-based classifiers are harder to train due to their complexity and number of parameters that they need to learn [15]. Therefore, they require more training samples compared to frame-based classifiers. Furthermore, the hidden model used in our experiments does not benefit from supervision during training. Due to such reasons, for the detection of temporal segments for both the face and the body modalities, framebased classifiers outperformed the sequence-based classifier. The confusion matrices for frame-based classification show that the neutral and apex segments enjoy higher detection rates than the onset and offset segments. This can easily be explained by the observation that muscular contractions during the apex and neutral phases are constant; this, in turn, leads to feature vectors whose values aggregate in compact clusters. Instead, both the onset and offset segments are transient phases that are characterized by very dispersed sets of feature vectors, which are intermediate in value between those of the neutral and apex frames. As a further observation, the neutral phases are basically identical across different emotions and cannot help with their discrimination; apex phases, on the other hand, are maximally discriminative. These observations led to the fundamental design choices of FABO: 1) decoupling temporal and spatial complexities by identifying the frames from the apex phase prior to affect recognition and 2) basing affect recognition solely on the apex frames’ feature vectors. B. Monomodal Affect Recognition The experiments for sequence-based affect recognition were carried out with a range of initial values for the variances (0.01–5) and using twofold cross validation. We used 286 training and 253 test videos for the face and body modalities separately. Even with different initial variance values, the overall recognition results did not noticeably change. Table VI

15

TABLE VI CONFUSION MATRICES FOR THE CLASSIFICATION RESULTS OF THE (TOP) FACE AND (BOTTOM) VIDEOS INTO 12 AFFECTIVE STATES USING HMMS

provides the confusion matrices of the classification results of the face (top) and body (bottom) videos into 12 affective states using HMMs. As can be observed from the table, a detection rate of only 11% and 12.6% was achieved for the face and body videos, respectively, which is just slightly better than that of chance classification. For the experiment on frame-based recognition, the available frames from the face and body video sequences were partitioned into two data sets. For the face, a training data set of 19 625 frames with 152 features and a test data set of 16 759 frames were used. Similarly, for the body, a training data set of 19 584 frames with 170 features and a test data set of 16 783 frames were used. To avoid the pollution of the affective state classifiers, only for the training phase, we decided to use manually selected apex frames. For the face modality, 6346 manually selected apex frames with 50 features were used. Similarly, for the body modality, 5587 manually selected apex frames with 45 features were used. For testing, we obviously resorted to automatically detected apex frames to test the whole system in operation, i.e., the accuracy of the affect recognition is therefore affected by that of the temporal segment detection as in a real application. For testing the face and body modalities, 5456 and 4633 automatically classified apex frames were used, respectively. Results obtained by training and testing various classifiers are presented in Table VII. For the face modality, the best recognition results were obtained using Adaboost with C4.5. Out of a total of 5456 apex frames, 1922 were correctly classified, and a classification rate of 35.22% was achieved. For the body modality, the best recognition results were obtained



TABLE VII MONOMODAL CLASSIFICATION RESULTS OF THE FRAMES FROM (LEFT) FACE AND (RIGHT) BODY VIDEOS INTO AFFECTIVE STATE CATEGORIES

TABLE VIII DETAILED ACCURACY BY CLASS FOR THE CLASSIFICATION OF THE FRAMES FROM THE (LEFT) FACE (C4.5/ADABOOST ) AND (RIGHT) BODY (RANDOM FOREST) VIDEOS INTO 12 AFFECTIVE STATES

using the classifier model Random Forest of ten trees, each constructed while considering six random features. Out of a total of 4312 apex frames, 3315 were correctly classified, and a classification rate of 76.87% was achieved. Detailed accuracy by class for the classification of face and body videos into affective state categories using the frame-based recognition method is provided in Table VIII. The difference in the achieved correct classification rates for the face and body modalities for monomodal affect recognition can be read as follows: Body gestures/postures provide better information than the face modality. This finding, in a certain way, confirms what has been reported by previous researchers who have attempted to compare monomodal and multimodal/multicue affect recognition from face and body cues (e.g., [5], [37], [38], and [72]). For instance, Balomenos et al. [5] reported that they achieve higher accuracy when recognizing six affective states from body gestures compared to facial expressions. Kapoor and Picard in [38] reported that the posture channel can classify the affective states best (81.97%), followed by the features from the upper face (66.81%). For discerning posed from spontaneous smiles, the results obtained by Valstar et al. [72] seem to indicate that the head is the most reliable source, closely followed by the face. The results obtained from the experiments conducted in this paper confirm such findings. Overall, for automatic affect sensing or recognition, certain modalities or cues seem to be more reliable than others, depending on the context and/or the problem at hand. However, extensive studies should be conducted to confirm which modality or cue is more reliable in which cases. If the results obtained from the sequence-based classification in Table VI are compared to those obtained from the frame-

TABLE IX BIMODAL CLASSIFICATION RESULTS OF THE COMBINED FACE AND BODY FEATURE VECTORS INTO AFFECTIVE STATES

TABLE X DETAILED ACCURACY BY CLASS FOR THE CLASSIFICATION OF COMBINED FACE AND BODY FEATURE VECTORS INTO AFFECTIVE STATES

based classification in Table VII, it is possible to conclude that the accuracy significantly increases for both the face and body modalities using the latter approach. This confirms the validity of the fundamental design choices of the FABO system. C. Bimodal Affect Recognition For sequence-based classification, using the sum criterion, decision-level fusion was applied on the HMM results for 253 video sequences of affective face and body display. Out of 253 sequences, 44 were correctly classified, resulting in 17.3% detection rate. Although fusion helps improve the results, the detection rate using HMMs generally remains low. For frame-based classification of affective states, temporal segments were first detected, and apex frames were selected, as described in Section V. Only for the training phases of the affective state classifiers, we decided to use manually segmented apex frames to improve training. For feature-level fusion, we used 5079 concatenated feature vectors from the face and body. For decision-level fusion, we used the same training sets as for monomodal recognition. For feature-level fusion, 3522 concatenated feature vectors from the face and body from automatically selected apex frames were used for testing. The results are presented in Table IX. The best recognition results were obtained using the classifier model Adaboost with Random Forest of ten trees, each constructed while considering seven random features. Out of a total of 3522 apex frames, 2911 were correctly classified, and a classification rate of 82.65% was achieved. Detailed accuracy by class is provided in Table X. For decision-level fusion, the best recognition results obtained during monomodal affect recognition were used (see Table VII). For the face modality, the best recognition results were obtained using Adaboost with C4.5, and for the body


17

TABLE XI RESULTS OF THE DECISION-LEVEL FUSION

Fig. 9. Sample images from challenging facial data analyzed by the FABO system. (First row) Face video during which the mouth region is occluded by the hand. (Second row) Face video during which there is substantial head movement.

modality, the best recognition results were obtained using the classifier model Random Forest of ten trees. Thus, affect recognition was achieved by using all the apex frames obtained from the previous step. Monomodal recognition results of the corresponding face and body videos were combined using sum, product, and weighted sum criteria. The results are presented in Table XI. According to the experimental results, with 78% classification rate, the weighted sum criterion with σf = 0.3 for the face modality and σb = 0.7 for the body modality provides better recognition results than the unweighted sum or product criteria. A major finding of the experiments conducted is that, for affect recognition, face modality alone is not sufficient (i.e., affect recognition from the face modality is not high). There may be many explanations for this such as latent variables. Although the FABO database was created in laboratory settings, the FABO data differ from other data used in facial expression recognition (e.g., the Cohn–Kanade database or MMI database; see [29] for details on these) in the following four ways: 1) within-class variation; 2) rigid head motion; 3) occluded face/body features; 4) multiple and asynchronous displays. As strict instructions were not given during the recordings of the FABO database, the difference is not only across subjects but also across sessions for the same emotional/affective display. Videos within a class vary along several dimensions such as the underlying configuration and dynamics of the head, face, and body display. Moreover, as the subjects were simultaneously but asynchronously displaying both face and body expressions, and there were no strict restrictions on the head or body movements of the actors, substantial degree of head motion and significant amounts of in-plane and in-depth rotations was present in most videos (see Fig. 9). In general, the classification rate obtained with feature-level fusion is similar to that of the body modality alone; however, certain interclass ambiguities have been resolved, and high interclass error rates have been reduced. In order to illustrate how the fused feature vector helps with resolving the ambiguity

caused by the face or body modalities, classification results obtained for video 009 from subject 1 (face and body) are provided in Table XII. For the given video, the actual label of both the face and body modalities is “negative surprise.” As can be seen from the table, although there is confusion in the monomodal recognition results (either the face or body modality), the ambiguity is resolved to some extent by decisionlevel fusion and perfectly by feature-level fusion. As performance comparison is a significant issue, the next question is whether it is possible to meaningfully compare the FABO system with other existing systems or approaches. First, in order to make the evaluation comparable, we extend the affective state classification from frame basis to video basis since most results in the literature are based on the latter. To this aim, we used the results obtained from feature-level fusion and elicited a simple majority criterion based on the classification of the individual frames in the video. This provided us with 85% classification accuracy, with 215 out of 253 testing videos correctly classified. Second, we present a comparison of results from our approach and relevant experiments on multimodal affect recognition in Table XIII. Considering that the work presented by Kapoor and Picard [38] is the most similar to the FABO system, a comparison between these two systems is provided next. Under a similar experimental methodology (using 50% of the data for training and the remaining 50% for testing), the FABO system obtains average recognition rates of 82.6% (frame basis) and 85% (video basis) for the recognition of 12 affective states (anger, anxiety, boredom, disgust, fear, happiness, negative surprise, positive surprise, neutral surprise, uncertainty, puzzlement, and sadness) and their temporal segments (neutral–onset–apex–offset–neutral) from bimodal face and body display; on the other hand, the system of Kapoor and Picard [38] reports 86.5% accuracy when detecting three affective states (high interest, low interest, and refreshing) from the face video, posture sensor, and the game being played. While accuracy is comparable, the scenario of applications for these sensors seems to be more narrow than that of the face and body video data. The FABO database used in this paper was collected by the authors, and the experiments were conducted using this database. It might be argued that results on another benchmark database should be included in order to quantitatively discuss



TABLE XII SAMPLE VIDEO AND HOW AFFECT RECOGNITION IS ACHIEVED BY AUTOMATICALLY SELECTING THE APEX FRAMES AND USING THEM FOR AFFECT RECOGNITION. THE ACTUAL LABEL OF BOTH FACE AND BODY VIDEOS IS “NEGATIVE SURPRISE.” RESULTS FROM MONOMODAL (USING EITHER THE FACE OR THE BODY MODALITY ALONE) AND BIMODAL CLASSIFICATIONS ARE PROVIDED AND COMPARED

TABLE XIII COMPARISON OF RESULTS FROM OUR APPROACH AND RELATED WORKS ON MULTIMODAL AFFECT RECOGNITION

our results to other approaches. We would like to note that the FABO database was created for such a purpose, and other works have already started using it for their research (e.g., [66]). Shan et al. [66] are the first to report on the affect recognition results using the FABO database. They exploit the spatiotemporal features based on space-time interest point detection for representing body gestures in videos. They fuse facial expressions and body gestures at the feature level by using the Canonical Correlation Analysis, a statistical tool that is suited for relating two sets of signals. For their experiments, they selected 262 videos of seven affective states (anger, anxiety, boredom, disgust, happiness, puzzle, and surprise) from 23 subjects in the FABO database and obtained 88.5% accuracy. In order to compare the two systems under similar criteria, we experimented on how the FABO system performs with the seven affective state categories used by Shan et al. By eliciting the majority criterion, we obtained 89.8% accuracy. Based on these comparisons, we can conclude that, using the same database, the FABO system obtains a slightly higher recognition

accuracy, exhibits a higher level of expertise (recognizes 12 affective states), can explicitly detect temporal segments and achieve synchronization, and performs affect recognition on both frame basis and video basis. VIII. S UMMARY AND C ONCLUSION The outline of our approach starts from the separate analyses of facial expressions and body gestures and the detection of temporal segments of each individual frame. In order to achieve this, individual classifiers are separately trained from face and body features. Using only the detected neutral and apex frames, first, monomodal affect recognition is obtained for the sake of comparison. Second, affective face and body modalities are synchronized based on the apex phase and fused for classification at the feature level (“early” fusion), in which the data from both modalities are combined before classification, and at the decision level (“late” fusion), where probabilities from single modalities are combined.


Overall, from the experiments conducted in this paper, we have four possible conclusions. 1) For an automated system based on vision, it is easier to model and recognize affective states from global body and head region movements and their relationship between each other than the atomic movements of facial features. 2) Modeling, detecting, and using the temporal segments/ phases prove to be useful to the overall task of affect recognition. 3) The affective state classification using the phasesynchronized modalities achieves a better detection rate in general, outperforming the classification using the face or the body modality alone and that of sequential classifiers such as HMMs. 4) By comparing the experimental results, feature-level fusion seems to achieve a better detection rate compared to decision-level fusion. Within decision-level fusion methods, the weighted sum criterion proves the best combination of the two modalities. As previously stated, for automatic affect sensing or recognition, certain modalities or cues seem to be more reliable than others, depending on the context and/or the problem at hand. However, extensive studies should be conducted to confirm which modality or cue is more reliable in which cases. From the findings obtained, this paper concludes that featurelevel fusion seems to perform better when analyzing affective bimodal face and body data. However, this can be achieved only if it is possible to detect the temporal segments of the affective face and body display, accordingly synchronize them, and fuse the apex frames only (selective fusion). Consequently, multimodal affect recognizers have to integrate the temporal dimension. The FABO system is able to process bimodal data and their temporal segments and fuse them at either the feature level or the decision level. The various modules of the FABO system introduced in this paper have a number of limitations. The FABO system cannot deal with missing data/multiple expressions in a single video sequence (e.g., happiness, followed by anger, etc.) as it is currently assumed that the display starts with a neutral state and follows the neutral–onset–apex–offset–neutral temporal pattern. Moreover, within its feature extraction, it cannot handle distractions such as glasses, facial hair, and left/right face profile views. It is assumed that the background does not change, except the movement of the human body and face. The FABO system expects the scene to be significantly less dynamic than the user and only one user to be in the space. The FABO system also currently assumes that, initially, the person is in frontal view, the complete upper body is not occluded, and two hands are visible and not occluding each other. The CamShift algorithm shows shortcomings when it comes to tracking multiple objects in a scene. However, tracking of multiple targets could be improved in a number of ways. Dreuw et al. [21], for instance, have shown that tracking using dynamic programming with eigenfaces and skin color probability scoring functions performs better than tracking using either of these methods alone. They have also shown that their proposed approach is superior under noisy circumstances to the CamShift tracking

19

algorithm. Consequently, future investigations toward more robust detection and tracking of the upper body, face (and facial features), and hands, possibly by using dynamic programming for tracking, which also allows partial tracebacks (e.g., one or two frames), are necessary yet possible. In natural HCI settings, gestures are continuous. Due to this, gesture recognition requires spotting of the gesture (i.e., determining the start and end points of a meaningful gesture pattern from a continuous stream or time segmentation) [47]. The FABO system currently does not perform spotting of the gesture due to the particular nature of the data at hand (i.e., the phases are finite: neutral, onset, apex, and offset). However, the FABO system could be extended to analyze a gesture continuum, determine the start and end points of a meaningful gesture pattern, and subsequently apply recognition. Currently, the FABO system explicitly detects the temporal segments or phases of affective states/emotions, in order to decouple temporal dynamics from spatial extent and reduce the dimensionality of the problem compared to simultaneously dealing with them. A further goal of the FABO system is to investigate the importance of the order and duration of the temporal phases for interpreting emotional displays. As confirmed by many researchers in the field, directed affective face and body actions differ in appearance and timing from spontaneously occurring behavior [17]. The methods and experiments presented in this paper could be further extended by data obtained in natural and realistic settings. Our future research will focus on these issues.

ACKNOWLEDGMENT The authors would like to thank A. Gunes for her help with the FABO recordings and the annotation procedure and the anonymous reviewers for their critical feedback, valuable comments, and suggestions.

R EFERENCES [1] The Facial Recognition Technology (FERET) Database. [Online]. Available: http://www.itl.nist.gov/ iad/ humanid/ feret/ feret_master.html [2] J. Allman, J. T. Cacioppo, R. J. Davidson, P. Ekman, W. V. Friesen, C. E. Izard, and M. Phillips, NSF Report-Facial Expression Understanding, 2003. Report. [Online]. Available: http://face-and-emotion.com data face nsfrept basic_science.html [3] N. Ambady and R. Rosenthal, “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis,” Psychol. Bull., vol. 11, no. 2, pp. 256–274, 1992. [4] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. [5] T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, and K. Karpouzis, “Emotion analysis in man–machine interaction systems,” in Proc. Workshop Multimodal Interaction Related Mach. Learn. Algorithms, 2004, pp. 318–328. [6] M. S. Bartlett, G. Littlewort, M. G. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Fully automatic facial action recognition in spontaneous behavior,” in Proc. Int. Conf. Comput. Vis. Pattern Recognit., 2006, pp. 223–230. [7] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. Fasel, and J. R. Movellan, “Automatic recognition of facial actions in spontaneous expressions,” J. Multimedia, vol. 1, no. 6, pp. 22–35, Sep. 2006. [8] S. Bengio, “Multimodal speech processing using asynchronous hidden Markov models,” Inf. Fusion, vol. 5, no. 2, pp. 81–89, Jun. 2004.



[9] T. Bänziger and K. R. Scherer, “Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus,” in Proc. 2nd Int. Conf. Affective Comput. Intell. Interaction, 2007, pp. 476–487. [10] G. R. Bradski, “Computer vision face tracking for use in a perceptual user interface,” Intel Technol. J., 2nd Quarter 1998. [11] J. K. Burgoon, M. L. Jensen, T. O. Meservy, J. Kruse, and J. F. Nunamaker, “Augmenting human identification of emotional states in video,” in Proc. Int. Conf. Intell. Data Anal., 2005. [Online]. Available: https://analysis.mitre.org/proceedings/Final_Papers_Files/ 344_Camera_Ready_Paper.pdf [12] A. Camurri, I. Lager, and G. Volpe, “Recognizing emotion from dance movement: Comparison of spectator recognition and automated techniques,” Int. J. Hum.-Comput. Stud., vol. 59, no. 1/2, pp. 213–225, Jul. 2003. [13] Y. Chen, Y. Rui, and T. S. Huang, “Multicue HMM-UKF for real-time contour tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 9, pp. 1525–1529, Sep. 2006. [14] G. S. Christensen and S. A. Soliman, “A new technique for linear static state estimation based on weighted least absolute value approximations,” J. Optim. Theory Appl., vol. 61, no. 1, pp. 123–136, Apr. 1989. [15] I. Cohen et al., “Facial expression recognition from video sequences: Temporal and static modeling,” Comput. Vis. Image Underst., vol. 91, no. 1, pp. 160–187, Jul. 2003. [16] J. F. Cohn, J. F. Reed, Z. Ambadar, J. Xiao, and T. Moriyama, “Automatic analysis and recognition of brow actions in spontaneous facial behavior,” in Proc. IEEE Int. Conf. Syst., Man Cybern., 2004, vol. 1, pp. 610–616. [17] J. F. Cohn, L. I. Reed, T. Moriyama, X. Jing, K. Schmidt, and Z. Ambadar, “Multimodal coordination of facial action, head rotation, and eye motion during spontaneous smiles,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2004, pp. 129–135. [18] A. Corradini, M. Mehta, N. O. Bernsen, and J.-C. Martin, “Multimodal input fusion in human computer interaction on the example of the on-going nice project,” in Proc. NATO-ASI Conf. Data Fus. Situation Monitoring, Incident Detection, Alert Response Manage., 2003, pp. 223–234. [19] M. Coulson, “Attributing emotion to static body postures: Recognition accuracy, confusions, and viewpoint dependence,” Nonverbal Behav., vol. 28, no. 2, pp. 117–139, Jun. 2004. [20] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, L. Lowry, M. McRorie, L. Jean-Claude Martin, J.-C. Devillers, A. Abrilian, S. Batliner, A. Noam, and K. Karpouzis, “The humaine database: Addressing the needs of the affective computing community,” in Proc. 2nd Int. Conf. Affective Comput. Intell. Interaction, 2007, pp. 488–500. [21] P. Dreuw, T. Deselaers, D. Rybach, D. Keysers, and H. Ney, “Tracking using dynamic programming for appearance-based sign language recognition,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2006, pp. 293–298. [22] J. Driver and C. Spence, “Multisensory perception: Beyond modularity and convergence,” Curr. Biol., vol. 10, no. 20, pp. 731–735, Oct. 2000. [23] P. Ekman, “About brows: Emotional and conversational signals,” in Human Ethology: Claims and Limits of a New Discipline: Contributions to the Colloquium, M. V. Cranach, K. Foppa, W. Lepenies, and D. Ploog, Eds. New York: Cambridge Univ. Press, 1979, pp. 169–248. [24] P. Ekman and W. V. Friesen, Unmasking the Face: A Guide to Recognizing Emotions From Facial Clues. Englewood Cliffs, NJ: Prentice-Hall, 1975. [25] A. Elgammal, V. Shet, Y. Yacoob, and L. S. Davis, “Learning dynamics for exemplar-based gesture recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2003, pp. 571–578. [26] S. D. Villalba, G. Castellano, and A. Camurri, “Recognising human emotions from body movement and gesture dynamics,” in Proc. Int. Conf. Affective Comput. Intell. Interaction, 2007, pp. 71–82. [27] H. Gu and Q. Ji, “An automated face reader for fatigue detection,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2004, pp. 111–116. [28] H. Gunes, “Vision-based multimodal analysis of affective face and upper-body behaviour,” Ph.D. dissertation, Univ. Technol. Sydney (UTS), Sydney, Australia, 2007. [29] H. Gunes and M. Piccardi, “A bimodal face and body gesture data base for automatic analysis of human nonverbal affective behavior,” in Proc. Int. Conf. Pattern Recog., 2006, vol. 1, pp. 1148–1153. [30] H. Gunes and M. Piccardi, “Creating and annotating affect data bases from face and body display: A contemporary survey,” in Proc. IEEE Int. Conf. Syst., Man Cybern., 2006, pp. 2426–2433. [31] H. Gunes and M. Piccardi, “Observer annotation of affective display and evaluation of expressivity: Face vs. face-and-body,” in Proc. HCSNet Workshop Use Vis. Hum.-Comput. Interaction, 2006, pp. 35–42.

[32] H. Gunes and M. Piccardi, “Bi-modal emotion recognition from expressive face and body gestures,” J. Netw. Comput. Appl., vol. 30, no. 4, pp. 1334–1345, Nov. 2007. [33] B. Hartmann, M. Mancini, S. Buisine, and C. Pelachaud, “Design and evaluation of expressive gesture synthesis for embodied conversational agents,” in Proc. 3rd Int. Joint Conf. Auton. Agents Multi-Agent Syst., 2005, pp. 1095–1096. [34] E. Hudlicka, “To feel or not to feel: The role of affect in human–computer interaction,” Int. J. Hum.-Comput. Stud., vol. 59, no. 1/2, pp. 1–32, Jul. 2003. [35] R. Duin, A. Jain, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [36] R. El Kaliouby and P. Robinson, “Real-time inference of complex mental states from facial expressions and head gestures,” in Proc. Int. Conf. Comput. Vis. Pattern Recog., 2004, vol. 3, p. 154. [37] A. Kapoor and R. W. Picard, “Probabilistic combination of multiple modalities to detect interest,” in Proc. Int. Conf. Pattern Recog., 2004, vol. 3, pp. 969–972. [38] A. Kapoor and R. W. Picard, “Multimodal affect recognition in learning environments,” in Proc. ACM Int. Conf. Multimedia, 2005, pp. 677–682. [39] K. Karpouzis, G. Caridakis, L. Kessous, N. Amir, A. Raouzaiou, L. Malatesta, and S. Kollias, “Modeling naturalistic affective states via facial vocal and bodily expressions recognition,” in Artificial Intelligence for Human Computing, vol. 4451. New York: Springer-Verlag, 2007, pp. 11–92. [40] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [41] R. Laban and L. Ullmann, The Mastery of Movement, 4th ed. London, U.K.: Princeton Book Company, 1988. [42] R. Lienhart and J. Maydt, “An extended set of Haar-like features for rapid object detection,” in Proc. IEEE Int. Conf. Image Process., 2002, vol. 1, pp. 900–903. [43] C. L. Lisetti and F. Nasoz, “Maui: A multimodal affective user interface,” in Proc. ACM Int. Conf. Multimedia, 2002, pp. 161–170. [44] J.-C. Martin, R. Niewiadomski, L. Devillers, S. Buisine, and C. Pelachaud, “Multimodal complex emotions: Gesture expressivity and blended facial expressions,” J. Humanoid Robot., vol. 3, pp. 831–843, 2006. [45] R. Masters, “Compassionate wrath: Transpersonal approaches to anger,” J. Transpers. Psychol., vol. 32, no. 1, pp. 31–51, 2000. [46] T. O. Meservy, M. L. Jensen, J. Kruse, J. K. Burgoon, J. F. Nunamaker, D. P. Twitchell, G. Tsechpenakis, and D. N. Metaxas, “Deception detection through automatic, unobtrusive analysis of nonverbal behavior,” IEEE Intell. Syst., vol. 20, no. 5, pp. 36–43, Sep./Oct. 2005. [47] S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 37, no. 3, pp. 311–324, May 2007. [48] L. Nigay and J. Coutaz, “A generic platform for addressing the multimodal challenge,” in Proc. Conf. Human Factors Comput. Syst. (CHI), 1995, pp. 98–105. [49] H. Ning, T. X. Han, Y. Hu, Z. Zhang, Y. Fu, and T. S. Huang, “A realtime shrug detector,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2006, pp. 505–510. [50] T. Otsuka and J. Ohya, “Spotting segments displaying facial expression from image sequences using HMM,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 1998, pp. 442–447. [51] M. Paleari and C. L. Lisetti, “Toward multimodal fusion of affective cues,” in Proc. 1st ACM Int. Workshop Human-Centered Multimedia, 2006, pp. 99–108. [52] H. Pan, T. S. Levinson, S. E. Huang, and Z.-P. Liang, “A fused hidden Markov model with application to bimodal speech processing,” IEEE Trans. Signal Process., vol. 52, no. 3, pp. 573–581, Mar. 2004. [53] M. Pantic and I. Patras, “Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 36, no. 2, pp. 433–449, Apr. 2006. [54] M. Pantic, A. Pentland, A. Nijholt, and T. Huang, “Machine understanding of human behavior,” in Proc. Int. Joint Conf. Artif. Intell., Workshop AI Hum. Comput., 2007, pp. 13–24. [55] M. Pantic and L. J. M. Rothkrantz, “Toward an affect-sensitive multimodal human–computer interaction,” Proc. IEEE, vol. 91, no. 9, pp. 1370–1390, Sep. 2003. [56] M. Piccardi and O. Perez, “Hidden Markov models with kernel density estimation of emission probabilities and their use in activity recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2007, pp. 1–8.


[57] A. Pikovsky, M. Rosenblum, and J. Kurths, “Phase synchronization in regular and chaotic systems: A tutorial,” Int. J. Bifurc. Chaos, vol. 10, no. 10, pp. 2291–2305, 2000. [58] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods-Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1977. [59] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [60] J. A. Russell, “A circumplex model of affect,” J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 1161–1178, 1980. [61] S. Salvador and P. Chan, “FastDTW: Toward accurate dynamic time warping in linear time and space,” in Proc. KDD Workshop Mining Temporal Sequential Data, Seattle, WA, 2004, pp. 70–80. [62] K. L. Schmidt and J. F. Cohn, “Human facial expressions as adaptations: Evolutionary questions in facial expression research,” Yearb. Phys. Anthropol., vol. 44, pp. 3–24, 2001. [63] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Emotion recognition based on joint visual and audio cues,” in Proc. Int. Conf. Pattern Recog., 2006, vol. 1, pp. 1136–1139. [64] N. Sebe, M. S. Lew, I. Cohen, Y. Sun, T. Gevers, and T. S. Huang, “Authentic facial expression analysis,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2004, pp. 517–522. [65] C. Shan, S. Gong, and P. W. McOwan, “Dynamic facial expression recognition using a Bayesian temporal manifold model,” in Proc. Conf. Brit. Mach. Vis. Assoc., 2006, pp. 297–306. [66] C. Shan, S. Gong, and P. W. McOwan, “Beyond facial expressions: Learning human emotion from body gestures,” in Proc. Brit. Mach. Vis. Conf., 2007. [Online]. Available: www.dcs.warwick.ac.uk/bmvc2007/ proceedings/CD-ROM/papers/paper-276.pdf [67] D. X. Sun, “Feature dimension reduction using reduced-rank maximum likelihood estimation for hidden Markov model,” in Proc. Int. Conf. Spoken Language Process., 1996, pp. 244–247. [68] Y. L. Tian, T. Kanade, and J. F. Cohn, Handbook of Face Recognition. New York: Springer-Verlag, 2005. [69] Y. L. Tian, T. Kanade, and J. F. Cohn, “Facial expression analysis,” in The Handbook of Face Recognition, S. Z. Li and A. K. Jain, Eds. New York: Springer-Verlag, 2005. [70] Y. Tong, W. Liao, and Q. Ji, “Facial action unit recognition by exploiting their dynamic and semantic relationships,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 10, pp. 1683–1699, Oct. 2007. [71] M. Valstar and M. Pantic, “Fully automatic facial action unit detection and temporal analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2006, pp. 149–154. [72] M. F. Valstar, H. Gunes, and M. Pantic, “How to distinguish posed from spontaneous smiles using geometric features,” in Proc. ACM Int. Conf. Multimodal Interfaces, 2007, pp. 38–45. [73] J. Van den Stock, R. Righart, and B. De Gelder, “Body expressions influence recognition of emotions in the face and voice,” Emotion, vol. 7, no. 3, pp. 487–494, Aug. 2007. [74] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2001, vol. 1, pp. 511–518. [75] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular value decomposition and principal component analysis,” in A Practical Approach to Microarray Data Analysis, D. P. Berrar, W. Dubitzky, and M. Granzow, Eds. Norwell, MA: Kluwer, 2003, pp. 91–109.

21

[76] A. D. Wilson, A. F. Bobick, and J. Cassell, “Temporal classification of natural gesture and application to video coding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1997, pp. 948–954. [77] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools With Java Implementations. San Francisco, CA: Morgan Kaufmann, 2000. [78] L. Wu, S. L. Oviatt, and P. R. Cohen, “Multimodal integration—A statistical view,” IEEE Trans. Multimedia, vol. 1, no. 4, pp. 334–341, 1999. [79] M. Yeasin, B. Bullot, and R. Sharma, “From facial expression to level of interests: A spatio-temporal approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2004, pp. 922–927. [80] Z. Zeng, Y. Fu, G. I. Roisman, Z. Wen, Y. Hu, and T. S. Huang, “One-class classification for spontaneous facial expression analysis,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recog., 2006, pp. 281–286. [81] Y. Zhang and Q. Ji, “Active and dynamic information fusion for facial expression understanding from image sequences,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 5, pp. 699–714, May 2005.

Hatice Gunes (S’99–M’07) received the Ph.D. degree in computing sciences from the University of Technology Sydney (UTS), Broadway, Australia, in 2007. She is currently a Research Associate with UTS, where she works on an Australian Research Council–funded Linkage Project for UTS and iOmniscient Pty Ltd. Her research interests include video analysis and pattern recognition, with applications to affective computing and video surveillance. Dr. Gunes is a member of the Association for Computing Machinery.

Massimo Piccardi (A’92–M’99–SM’05) received the M.Eng. degree in electronics engineering and the Ph.D. degree in computer engineering/computer science from the University of Bologna, Bologna, Italy, in 1991 and 1995, respectively. He is currently with the University of Technology Sydney, Broadway, Australia, where he is a Professor with the Faculty of Information Technology and the Director of the Advanced Video Surveillance Program. He is the author or coauthor of more than 100 scientific papers in international journals and conference proceedings. He is also the recipient of several competitive research grants mainly in the area of video surveillance. His research interests include computer vision, pattern recognition, and video analysis, with main applications to video surveillance, human-computer interaction, and multimedia. Dr. Piccardi is a Senior Member of the IEEE Computer Society and a Member of the International Association for Pattern Recognition.

Automatic Temporal Segment Detection and Affect ... - Semantic Scholar

Automatic Temporal Segment Detection and Affect ... - Semantic Scholar

Suggest Documents