Natural Emotion Elicitation for Emotion Modeling in ...

2 downloads 0 Views 4MB Size Report
M. Oveneke, L. Latacz, W. Verhelst, H. Sahli, D. Henderickx,. B. Vanderborght, E. Soetens, and D. Lefeber, “Voice Modifica- tion for Wizard-of-Oz Experiments in ...
Natural Emotion Elicitation for Emotion Modeling in Child-Robot Interactions Weiyi Wang1 , Georgios Athanasopoulos1 , Selma Yilmazyildiz1 , Georgios Patsis1 , Valentin Enescu1 , Hichem Sahli1,2 , Werner Verhelst1 Antoine Hiolle3 , Matthew Lewis3 , Lola Ca˜namero3 1

Dept. of Electronics and Informatics (ETRO) - AVSP, Vrije Universiteit Brussel (VUB), Belgium 2 Interuniversity Microelectronics Centre (IMEC), Belgium 3 Embodied Emotion, Cognition and (Inter-)Action Lab, School of Computer Science & STRI, University of Hertfordshire, UK {wwang, gathanas, syilmazy, gpatsis, venescu, hichem.sahli, wverhels}@etro.vub.ac.be, [email protected], [email protected], [email protected]

Abstract Obtaining spontaneous emotional expressions is the very first and vital step in affective computing studies, for both psychologists and computer scientists. However, it is quite challenging to record them in real life, especially when certain modalities are required (e.g. 3D representation of the body). Traditional elicitation and capturing protocols either introduce the awareness of the recording, which may impair the naturalness of the behaviors, or cause too much information loss. In this paper, we present natural emotion elicitation and recording experiments, which were set in child-robot interaction scenarios. Several state-of-the-art technologies were employed to acquire the multi-modal expressive data that will be further used for emotion modeling and recognition studies. The obtained recordings exhibit the expected emotional expressions. Index Terms: child-robot interaction, natural emotion elicitation, multi-modal recording

1. Introduction Affective computing [1] has been becoming a hot interdisciplinary topic since the last two decades. Developing emotion recognition algorithms requires collecting reliable data. This task, is very challenging, due to the practical limitation of emotion elicitation methods, capturing devices, recording setups etc., especially under naturalistic settings. In general, the traditional emotion elicitation approaches employed visual and/or auditory stimuli to induce certain expressions. The most widely used method is using films. Gross et al. selected two films for each of the eight emotional states (amusement, anger, contentment, disgust, sadness, surprise, neutral state and fear) [2], and evaluated the efficacy via assessing the discriminability, discreteness and similarity. Many other movie sets were proposed under similar concept, while the main drawback lies in its static and passive nature: the participants are hardly expressing externally, especially via the body, in the non-interactive environment. Dyadic interaction tasks also attracted many research works by introducing the communication between participants [3]. The emotional expressions are enriched indeed, mainly through facial cues. However, it is relatively difficult to design the conversation scope to successfully trigger the emotions. Other stimuli such as music [4], gaming interface [5] and memory recall [6] were also used but under

specific contexts and constrains. The rapid development of humanoid robots with interaction capabilities brings great an opportunity to elicit emotions, especially childrens. Unlike interacting with people, it appears more acceptable for children to play with robots with reduced autonomy. This allows us to limit the recording environment, which creates the advantage of easier data processing and analysis. Sanghvi et al. [7] used a iCat robot in a chess playing scenario to elicit and evaluate affective postures and body motion of children. Batliner et al. [8] designed a scenario in which the children instructed the AIBO robot to fulfil some specific tasks, the verbal and vocal information were recorded for emotion analysis. In this work, we are interested in the speech, facial, and body emotional channels, and present experiments that aim at eliciting and capturing the spontaneous emotional expressions of children in the child-robot interaction scenarios. Multiple modalities (accurate 3D body representation, 2D body postures and gestures, facial expressions and emotional speech) have been recorded without being aware. To our best knowledge, this is the first published work on children’s natural emotion elicitation and recording on those modalities. The rest of the paper is organized as following: Section 2 introduces the overview and structure of our experiments, as well as the interfaces and modules we used. Section 3 describes the device setups and experiment protocol. We then give some results and illustration of the experiments in section 4 and discuss the challenges and prospects in section 5.

2. Emotion Elicitation and Interaction Scenario The whole experiment was set in a child-robot interaction scenario. The humanoid robot NAO [9], equipped with various controlling and behavioural modules was used. We designed two sets of experiments running in parallel at the same location (but in different rooms), referred to as Setup-A and Setup-B, respectively. With Setup-A, the main purpose was to elicit natural emotional expressions and record them in different channels. Setup-B aimed at evaluating the perception and response to the emotional gibberish produced by the robot. The children were invited to play with the robot during the week. In order to guarantee that the children were not aware of

the recording thus could behave naturally, they were told that the capturing devices are used to help the robot better sense the world. 2.1. Scenarios • Setup-A: Each child played several times the ‘Snakes & Ladders’game against the NAO robot, with different game settings and emotional profiles (detailed in section 2.2.1). The Wizard-of-Oz (WoZ) technique was used to control the robot. Moreover, the gestures and movements of the robot were designed according to a specific emotionally expressive profile reacting to the events occurring during the game. In this setup we recorded body skeleton in 3D, face video stream in HD, frontal body video stream in HD, audio stream from both the child and the robot, and the video stream of the whole scene. A dual-Kinect setup was used to capture the body skeleton. • Setup-B: Each child watched selected animation movie clips together with the robot. Each movie clip was rich in terms of one of the basic emotions, namely anger, disgust, fear, happiness, sadness or surprise. At the end of each movie clip, the robot uttered its emotion to the child in an affective gibberish speech (see section 2.2.4 for details) and the child had to rate the recognized emotion on valence and arousal dimensions using SAM measurement tool [10]. There were two different robot profiles: having the same (inline case) or a contradictory (confusion case) emotion with the main character in the movie clips. In this setup, we recorded the frontal body video stream (in a sitting position) and audio stream. 2.2. Modules 2.2.1. Snakes & Ladders Game The game was developed as a networked module of the robot. The game engine communicates wirelessly with the robot, so that the embedded robot controller reacts to the real time events occurring during the game. The Snakes and Ladders game itself follows the traditional rules of the game. Two or more players take turn rolling one or two dice, and their assigned token progresses on a board according the throw. The first one to reach the end of the board wins. Moreover, if one player lands on the bottom of a ladder, its token climbs up the ladder, helping the player advance further. If a token lands on the head of a snake, the token would slide down the snake and end up on the tile where the tail of the snake is located. The board of the game was displayed on a big TV monitor (see section 3 for the device arrangement), and the robot looked at the screen and the child alternatively during the game in a natural way. Rolling the dice was performed in two different ways. For the first half of the experiments, the child was asked to use a remote to throw the dice. The robot was holding a remote as well, pretending to have to do the same as the child to play the game. In order to eliminate the artefacts of the ‘dice throwing’ movements for automatic bodily behavior analysis, in the rest sessions we told the children that we improved the game so that it was now capable of recognizing the voice command ‘Blip-Blip’to throw the dice. Apart from the verbal communication via the WoZ (described in section 2.2.2), the robot occasionally produced bodily expressions following specific game events such as landing on a snake, a ladder, or winning or losing the game. The robot would, randomly, either nod or take a happy body posture following positive events, and

would shake its head, or clap its thigh after negative events. A procedural noise [11] (adapted from the Perlin noise) was applied to all the upper body joints so that the robot would not stay still, and giving a life-like appearance. All these elements, together with the WoZ communication (see section 2.2.2) guaranteed that the child would think the robot was playing the game instead of controlling it. To better cope with the emotion elicitation purpose, we manually scripted the unfolding of four games (the child and the robot would win two of those respectively, and all the dice throws were predefined). The game steps were designed to be dramatic and therefore produce a clear reaction from the child, either positive or negative. For instance, the child would be leading the game at the beginning then land on a long snake right before the end point. Alternatively, in one other predefined game, the child’s token would be trailing back for most of the game, then would “luckily” hit ladders and obtain good dice rolls to overtake the robot and then win the game. Moreover, the robot could display two different affective profiles while playing the game: one competitive, where the robot would display self-centered emotions, and one was supportive, focusing on the child’s performance. The profiles were displayed alternatively during the sessions. The competitive profile made the robot react strongly to positive events and negative events occurring to the robot, making the robot appear more involved in the game and eager to win it. The behaviours and gestures of the robot appeared more aroused and energetic. Following literature on empathy and sympathy and their importance in peer bonding and fostering trust [12], the robot using the supportive profile displayed and expressed behaviours suggesting a more focused interest on the outcomes for the child’s. The robot would display disappointment or sadness when the child would hit a snake or lose, while displaying positive expression when the child was leading or winning the game. Additionally, the expressions of the wizard were consistent with the specific profile of the robot used at the time. 2.2.2. Voice Modification for WoZ In order to efficiently elicit natural emotions through the interaction scenario described above, we hypothesize that a believable interaction should be maintained and hence the robot’s verbal communication should be as natural as possible. With this in mind, we opted not to base the responses of the robot on a pre-defined dialogue scheme. Instead, in our approach, the robot’s responses evolve in an add-hoc base following the child-robot interaction development. Therefore, using a TTS (text to speech) system for producing the robot’s verbal (as well as non-verbal) responses is not sufficient due to two main limitations: • Quality of synthesized speech (e.g., lack of natural emotional display and reliable non-verbal capabilities); • Need for typing in the add-hoc responses and thus introducing unnatural latency in the verbal output. To overcome these limitation, we have chosen an approach which allows the streaming of the WoZ operator’s speech to the robot in real-time, with the voice being modified so that it resembles the robot’s voice. We have successfully applied this approach in previous experiments [13]. The voice modification is achieved by spectral shifting of the original speech signal and it is implemented using the robust time-scaling WSOLA algorithm [14]. It is worth noting that this approach does not affect

Figure 1: The WoZ GUI interface.

the speech rate, while the vocal emotions of the WoZ are preserved.

Figure 2: Illustration of Setup-A.

2.2.3. Emotional Gestures and Controlling GUI Before the start of the interaction in Setup-A described above, a short familiarization phase took place so that the children would feel comfortable interacting with the robot. During this phase, the robot was introducing itself and some light talk was initiated with each child (e.g., asking its name, age etc.). In order to familiarize the children with the robot’s movements and emotional expressions, animated behaviors were implemented. These behaviors included simple gestures (e.g., standing up, waving hello, nodding, etc.), emotional postures (similar to the ones implemented in [15]), as well as control moves (i.e., walking, rotating, etc.) that allowed the WoZ to correct the robot’s position during the interaction. The desired action was selected by the WoZ through a simple graphical interface (see Figure 1). A similar interface was also used to control the robot and to utter the gibberish speech in Setup-B. 2.2.4. Gibberish Generation Gibberish speech consists of vocalizations of meaningless strings of speech sounds, and thus has no semantic meaning. Its advantage in affective computing lies in the fact that no understandable text has to be pronounced and therefore only affect is conveyed. Additionally, it was seen in [16] that such a speech without semantic information contributed significantly to the emotion expression of the robotic agent. Therefore, we aimed at evaluating the expressiveness of the gibberish speech further, in a real child-robot interaction scenario and observe the affect perception and responses of the children in Setup-B. Gibberish speech was generated in two main steps [17]: First step was creating a gibberish-like text. This is realized by replacing the vowel nuclei and consonant clusters of dialogue text scripts in accordance with their natural probability distribution of a language. In the second step these gibberish text scripts were vocalized by recording a voice actor portraying emotions. It is then possible to synthesize more variations on the utterances with a concatenative technique which corresponds to swapping the units of an utterance with other units from the database of the related emotion. For the experiment in Setup-B, two samples of about 7-10 seconds were generated per each emotion. Some additional sentences were produced in neutral tone to be used in greeting the child to familiarize him/her with the gibberish speech. The samples were played through the robot at predefined moments during the interaction scenario in Setup B by the WoZ.

3. Experiment Setup During the interaction, the child was alone with the robot. Due to the requirement of frequent verbal communication between the wizard and the child in Setup-A, we arranged another room, from where the wizard could also monitor the interaction, and control the game procedure. In Setup-B, as no verbal communication was needed from the wizard, the monitoring and robot control was performed from the open space outside the room. • Setup-A: The arrangement is illustrated in Figure 2. The game was displayed on the TV screen and the robot was set on a table next to the TV to have similar height of the children. HD Camera 1 was used to capture the facial expressions, which was hidden behind a black curtain. Two Microsoft Kinect sensors were placed at the frontal left and right corners, with the angle of 90◦ . The dual-Kinect setup ensured an accurate capture of body movements in 3D. The recorded data was post-processed by the iPi Mocap Studio software [18] to extract the 3D skeleton representations. HD Camera 2 and 3 captured the frontal full body of the child and the whole scene respectively. All devices were connected to the WoZ room where the wizard could control the game, monitor the interaction and communicate with the child through the robot. Note that most of the devices are visible to the children, in order to avoid the awareness of the recording, the wizard told the children that these devices are used to help the robot better sense the environment. In total 12 children participated in our experiment (aged between 7 and 9, 5 females and 7 males), each child had three sessions during the five days recordings. In each session, the child played twice the Snakes & Ladders game, with different affective profiles of the robot and different game steps (the child would win one of them). The wizard was experienced in communicating with young children, and was sufficiently trained in our previous experiments to be very familiar with the purpose of our research, robot controlling GUI and the game procedure. • Setup-B: The child and the robot watched the movie clips sitting next to each other at different chairs and the clips were displayed on a TV screen (Figure 3). The height of the robot was adjusted to the child. The camera was

Figure 4: Examples of the recordings in Setup-A. Four rows are the frontal face, frontal full body, reconstructed 3D skeleton and interaction scene, respectively.

Figure 5: Examples of the recordings in Setup-B. Figure 3: Illustration of Setup-B.

placed at a location allowing to capture frontal body streams of both the child and the robot. The camera and the microphone were connected to the computer outside the room where a wizard monitored and controlled the robot as well as the overall experimental procedure. 10 children participated in the experiment in Setup-B (aged between 7 and 9, 4 females and 6 males). There were two series of sessions held: One for in-line and one for confusion cases. All children attended at least one of the sessions. Prior to the first session, each child went through a short training on how to rate along the valence and arousal dimensions using SAM. The training consisted of a verbal training, joint exercise with the trainee and exercise alone parts in which the child had to rate affective pictures using SAM.

4. Experimental Results In total we recorded 36 sessions of 20 to 30 minutes each. According to the feedback we received from the children, they enjoyed playing with the NAO robot. From the visual inspection of the recordings, we confirmed our initial target, being, a

wide range of emotions were elicited spontaneously and well captured during the interactions. In the meantime, in Setup-B, we recorded 18 sessions of about 15 to 30 minutes each. The children could mostly perceive the affective information from the gibberish speech generated by the robot, and frequently had corresponding reactions, mainly via facial expressions. All the children reported that they want to watch movie clips with the robot again in the future which indicates that they enjoyed their participation. We are currently post-processing the recorded data, and annotating them. From the dual-Kinect captures we are extracting the skeleton. In Figure 4 and Figure 5 we give some examples of the recorded data in both Setup-A and Setup-B, from where we can see rich expressions from both the face and the full body. The movement of the 3D skeletons is very smooth and stable. Most importantly, the postures and gestures were well captured even when some body parts are occluded. For the data annotation we generated a synchronized threeviews videos, as illustrated in Figure 6, and used the GTrace [19] for the arousal and valence [20] annotation. Figure 7 illustrates an example of the valence values by four raters. As it can be seen, the four ratings have similar trends and patterns, which indicates a high agreement of perceived valence values of the

KIK project in Brussels, Belgium, who provided us the opportunity to implement our data acquisition scenario. We also thank the participated children and their parents.

Figure 6: Three-views video used for emotion annotation.

Figure 7: Annotation of valence (indicating positive and negative affects) by four raters.

child, based on the combination of all recorded channels.

5. Discussion Emotion elicitation and emotional data acquisition is a crucial step for emotion modeling. Previous published work in this area mainly focused on either acted expressions, or natural but very limited signals. We therefore proposed a child-robot interaction scenario, where the child behaves spontaneously during the interaction and game competition with a humanoid robot. With such kind of scenarios, audio and visual information could be well recorded without being aware. Especially the dual-Kinect setup and the 3D skeleton reconstruction ensure a precise body representation that may convey significantly important affective cues, indicated by psychological studies [21, 22]. Our initial experiments have demonstrated the effectiveness of the emotion elicitation scenario, in terms of rich naturalistic affective expressions. The next step of our research will be focused in finalizing the valence/arousal annotation, and to evaluate and improve our different emotion recognition modules [23, 24, 25, 26, 27].

6. Acknowledgement This work has been made under the EU-FP7 ALIZ-E Project (No. 248116), also supported by the CSC grant 2011688012. We also would like to thank the members of Beeldenstorm and

7. References [1] R. Picard, “Affective computing,” MIT, Tech. Rep. 321, 1995. [2] J. J. Gross and R. W. Levenson, “Emotion Elicitation Using Films,” Cognition and Emotion, vol. 9, no. 1, p. 87 108, 1995. [3] N. A. Roberts, J. L. Tsai, and J. A. Coan, “Emotion Elicitation Using Dyadic Interaction Tasks,” in Handbook of Emotion Elicitation and Assessment, 2007, pp. 106–123. [4] K. R. Scherer and M. R. Zentner, “Emotional Effects of Music: Production Rules,” in Music and Emotion: theory and Research, 2001. [5] N. Savva and N. Bianchi-berthouze, “Automatic recognition of affective body movement in a video game scenario,” in International Conference on Intelligent Technologies for interactive entertainment (INTETAIN), 2012, pp. 149–158. [6] G. Chanel, J. J. Kierkels, M. Soleymani, and T. Pun, “Short-term Emotion Assessment in a Recall Paradigm,” International Jounal of Human-Computer Studies, vol. 67, no. 8, pp. 607–627, 2009. [7] J. Sanghvi, G. Castellano, A. Paiva, and P. W. Mcowan, “Automatic Analysis of Affective Postures and Body Motion to Detect Engagement with a Game Companion Categories and Subject Descriptors,” in In Proc. of ACM/IEEE International Conference on Human-Robot Interaction, 2011. [8] A. Batliner, C. Hacker, S. Steidl, and E. N¨oth, “’You Stupid Tin Box’-Children Interacting with the AIBO Robot: A Crosslinguistic Emotional Speech Corpus.” LREC, pp. 171–174, 2004. [9] “Aldebaran Robotics.” http://www.aldebaran.com

[Online].

Available:

[10] P. Lang and M. Bradley, “Measuring Emotions: The SelfAssessment Manikin and the Semantic Differential,” Journal of Behaviour Therapy & Experimental Psychiatry, vol. 25, no. 1, pp. 49–59, 1994. [11] A. Beck, A. Hiolle, and L. Ca˜namero, “Using Perlin Noise to Generate Emotional Expressions in a Robot,” in Proceedings of CogSci 2013 (The 2013 Annual Meeting of the Cognitive Science Society), 2013. [12] S. Preston and F. de Waal, “Empathy: Its Ultimate and Proximate,” Behavioral and Brian Sciences, vol. 252, pp. 1–72, 2002. [13] S. Yilmazyildiz, G. Athanasopoulos, G. Patsis, W. Wang, M. Oveneke, L. Latacz, W. Verhelst, H. Sahli, D. Henderickx, B. Vanderborght, E. Soetens, and D. Lefeber, “Voice Modification for Wizard-of-Oz Experiments in Robot-Child Interaction,” in Workshop on Affective Social Speech Signal, 2013. [14] W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (wsola) for high quality time-scale modification of speech,” ICASSP’93, vol. 2, pp. 554–557, 1993. [15] A. Beck, B. Stevens, K. Bard, and L. Canamero, “Emotional Body Language Displayed by Artificial Agents,” ACM transactions on Interactive Intelligent Systems. Special issue on Affective Interaction in Natural Environments, vol. 2, no. 1, pp. 1–29, 2012. [16] S. Yilmazyildiz, D. Henderickx, B. Vanderborght, W. Verhelst, E. Soetens, and D. Lefeber, “Multi-Modal Emotion Expression for Affective Human-Robot Interaction,” in Workshop on Affective Social Speech Signals (WASSS 2013), 2013. [17] S. Yilmazyildiz, D. Henderickx, B. Vanderborght, W. Verhelst, E. Soetens, and D. Lefebe, “EMOGIB: Emotional Gibberish Speech Database for Affective Human-Robot Interaction,” in Affective Computing and Intelligent Interaction, 2011, pp. 163–172. [18] “Ipi Mocap Studio.” [Online]. Available: http://ipisoft.com/ [19] R. Cowie and M. Sawey, “GTrace - General trace program from Queen s , Belfast,” pp. 1–23, 2011. [20] J. A. Russell, “A Circumplex Model of Affect,” Journal of Personality & Social Psychology, vol. 39, pp. 1161–1178, 1980. [21] H. G. Wallbott, “Bodily Expression of Emotion,” European Journal of Social Psychology, vol. 28, no. 6, pp. 879–896, 1998.

[22] B. de Gelder, “Why bodies? Twelve reasons for including bodily expressions in affective neuroscience,” Philosophical Tran. of the Royal Society B: Biological Sciences, vol. 364, pp. 3475–3484, 2009. [23] W. Wang, V. Enescu, and H. Sahli, “Towards Real-Time Continuous Emotion Recognition from Body Movements,” in In Proc.. of Human Behavior Understanding 2013. Lecture Notes in Computer Science (LNCS)., 2013, pp. 235–245. [24] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, V. Enescu, and H. Sahli, “Hybrid Deep Neural Network–Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition,” in Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, pp. 312–317. [25] D. Jiang, Y. Cui, X. Zhang, P. Fan, I. Gonzalez, and H. Sahli, “Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models,” in Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, 2011, pp. 609–618. [26] I. Gonzalez, H. Sahli, V. Enescu, and W. Verhelst, “ContextIndependent Facial Action Unit Recognition Using Shape and Gabor Phase Information,” in Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, 2011, pp. 548–557. [27] F. Wang, W. Verhelst, and H. Sahli, “Relevance vector machine based speech emotion recognition,” in Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, 2011, pp. 111–120.