Affective Intelligence: A Novel User Interface

0 downloads 0 Views 131KB Size Report
2 State-of-the-Art. Most research in the field of conversational agents and animated humans thus far employed low resolution (i.e. small polygon count) virtual ...
Affective Intelligence: A Novel User Interface Paradigm Barnabas Takacs Digital Elite Inc., Los Angeles, USA. [email protected] www.digitalElite.net

Abstract. This paper describes an advanced human-computer interface that combines real-time, reactive and high fidelity virtual humans with artificial vision and communicative intelligence to create a closed-loop interaction model and achieve an affective interface. The system, called the Virtual Human Interface (VHI), utilizes a photo-real facial and body model as a virtual agent to convey information beyond speech and actions. Specifically, the VHI uses a dictionary of nonverbal signals including body language, hand gestures and subtle emotional display to support verbal content in a reactive manner. Furthermore, its built in facial tracking and artificial vision system allows the virtual human to maintain eye contact, follow the motion of the user and even recognizing when somebody joins him or her in front of the terminal and act accordingly. Additional sensors allow the virtual agent to react to touch, voice and other modalities of interaction. The system has been tested in a real-world scenario whereas a virtual child reacted to visitors in an exhibition space.

1

Introduction

To mimic the quality of everyday human communication, future computer interfaces must combine the benefits of high visual fidelity with conversational intelligence and —above all— the ability to modulate the emotions of their users. During the past several decades, researchers have conducted countless studies on agents and human animation in order to create a HCI that works by utilizing the natural means of interaction, such as words, gestures, glances and body language instead of traditional computer devices, such as the keyboard and mouse. To address this problem we have been developing a novel system that uses photorealistic, high fidelity human representations and a natural model human dialog. Photo-realistic virtual humans due to their similarities to their real-life counterparts make a powerful affective interface and will likely be the primary means of communication between computers and humans in the future. Our solution builds upon many years of interdisciplinary research to create a closedloop model of interaction whereas the user’s internal state (emotion, level of attention, etc.) is constantly monitored and driven directly by the animated character with the purpose of creating emotional bonding. This emotional bonding then acts as a catalyst to help turn information to knowledge. In other words, our advanced user interface

draws on emotions to help its users (most frequently students) in the learning process by intelligently tailoring its workload and constantly adapting its presentation strategies. Hence the name affective intelligence.

2

State-of-the-Art

Most research in the field of conversational agents and animated humans thus far employed low resolution (i.e. small polygon count) virtual characters that was rather simple to animate in real time [1]. Many of these animation systems initially addressed purposes other than the needs of human–computer interaction. However, the requirements of facial animation—and especially speech synthesis—demand a different underlying architecture that can effectively model how real faces move and change [2-4]. As a result, research began to focus on creating autonomous agents that could exhibit rich personalities and interact in virtual worlds inhabited by other characters [5,6]. To provide the illusion of a life-like character, researchers have developed detailed emotional and personality models that can control the animation channels as a function of the virtual human’s personality, mood, and emotions [7,8]. However, the real-time interaction with these virtual characters posed an extra set of technical challenges in terms of the speed, computational power, and visual quality required to make the user believe that he or she is interacting with a living creature. To achieve this goal, researchers eventually replaced pre-crafted animated actions with intelligent behavior modules that could control speech, locomotion, gaze, blinks, gestures (including various postures), and interaction with the environment. Our VHI system builds on this research by employing photo-realistic virtual humans, providing users with information, learning services, and entertainment in a personalized and adaptive manner [9-11].

3

The Virtual Human Interface (VHI)

The Virtual Human Interface (VHI) system we have developed represents a shift in the traditional paradigm of HCI or conversational agents as it places the user in a closed-loop-dialogue situation where affect plays a key role during the interactive process. Specifically, while “traditional” HCI interfaces are designed to provide help on how to do things, the photo-real digital humans and synthetic 3D elements in the VHI system serve to engage the user in a process of emotional responses which in turn opens a channel to “engrave” knowledge and modify their behavior. From a practical point of view the VHI implements this functionality with the help of high fidelity virtual humans “who” can talk, act, emote and express a wide range of facial expressions as well as body gestures [11]. The underlying affective mechanisms of photo-real digital humans rely on the tremendous power facial information processing in the brain and therefore it works fundamentally differently from stylized humanoids or cartoon characters that base their mechanisms on self-projection. As such the VHI provides a novel means to unlock and further utilize the learning capability residing in each and every one of us. By conveying subtle facial signals, as

demonstrated in Figure 1., positive reinforcement allows us to access and activate both our declarative (i.e. our memory of information and events) and procedural memory. (i.e. the knowledge how to ride a bike) while engaging in the process of emotional responses in an entertaining fashion.

Fig. 1. Subtle changes of emotional display of a photo-real virtual with the user.

humans during interaction

The implementation of the VHI system involved many different modules of animation, perception, compositing and creating synthetic environments to place our virtual characters in. Figure 2. Specifically, our intelligent digital human “lives” within the confides of a high-throughput interactive virtual world. To create such an environment one needs to go beyond the traditional methodology of real-time rendering and support a variety input and output devices that facilitate the interaction process and create a fully immersive experience. Examples include a head-mounted display system (HMD), multiple 3&6DOF trackers, a low-cost hand glove, a special purpose eye tracker and standard devices, such as a joystick. In addition to delivering high-end graphics cards to create the highest possible visual realism the VHI system also features live video input, real-time image processing, face- and object recognition with tracking capabilities, chroma-key filters, an augmented reality interface and even panoramic 360O backgrounds for full immersion. Finally, beyond the visual experience, the system also supports quadraphonic 3D sound sources with simulated effect of speed and motion as the user moves around in the virtual space. Further details on the architecture and its major elements can be found in [15].

4

Perception and Affective Intelligence

The ability to perceive its environment, both synthetic and the outside world of that of the user, is a key element when attempting to build a truly intelligent virtual agent. We define VHI’s input modalities corresponding to vision, touch, and hearing in the context of the communication process and the functionality of the animated character itself. In particular, the three implemented senses connect to information channels that have a direct affect on the animated character’s gaze, facial expressions, locomotion, and body gestures. The foundation for implementing this functionality and executing

these actions hinges partly on the processing of external visual, auditory, and touchrelated signals. The VHI connects its users to the synthetic human’s 3D world by means of markers. Markers are invisible dimensionless representations of 3D positions attached to any object in the scene. They carry unique hierarchical labels and we can refer to them by their names in describing a high-level task. We could attach multiple markers to a camera, a table, the floor, or the virtual monitor—a special-purpose object in the virtual environment that is analogous to the computer screen. We map the live video stream directly onto this monitor and assign and display results of the visual processing here —the locations and identities of the people in front of the terminal— using temporally changing markers. One of the major advantages of using the marker-based representation is that it defines and executes at a high level, symbolically, all tasks that the virtual human needs to carry out. This information can then be processed by our cognitive engine based on the SOAR [8] architecture or controlled locally. An example of the latter is to direct the gaze of the character by issuing a “look at me” command where “me” is the name of the marker attached to the currently active camera. To carry out these commands, the VHI includes advanced target animation and inverse kinematics functions that take into consideration the current constraints on the virtual human’s body. Portable Real-time

High Fidelity Panoramic

3600

Physics & Particle Simulation

VHI

Computer

Virtual Studio Low-cost Input Devices &

Fig. 2. Key elements of the Virtual Human Interface used to implement our model of Affective Intelligence.

5

Closed Loop Model of Communication

The closed loop interaction model of human computer interaction, as implemented in the VHI affective intelligence system, combines the power of high fidelity facial animation with advanced computer vision and perception techniques, briefly described in the preceding section, to create a truly bi-directional user interface where the actions and reactions of the person in front of the computer monitor directly affect the behavior of the interactive content delivery system. In particular, we model the interaction process between the user and the system as part of a closed loop dialog taking place between the participants. This dialog, exploits the most important characteristics of a face-to-face communication, including the use of non-verbal gestures and meta-communication signals to control the flow of information. As such, the Closed Loop Dialog model draws on the characteristics of human face-toface communication specifically the gaze behavior that follows well researched patterns and rules of interaction. As an example, this mechanism — known as turntaking — may appear during the course of a computerized lecture where the role of the listener and the speaker shift periodically [12,14]. The principle idea behind this closed loop system is to treat the problem of recognizing a user’s internal state as an active optimization problem in contrast to a passive observer method. In this context the goal of the interface is to maximize the user’s ability to absorb the information being presented. This goal is achieved by constantly measuring levels of interest, attention and perhaps fatigue. In such a way the interface framework is enabled to consider how the user’s most important resource, namely his or her attention is allocated over time. To maximize the gain in knowledge we must minimize the cost of interaction by reducing the overall demand on attention. In other words, the system must be able to “read” the many telltale signs the user is projecting during the course of interaction. These signs — all of us read them during the coarse of our daily interactions with others — can be readily derived from visual cues made available to the training system using a simple web-camera. The dialog model presented here assumes that while being instructed or informed, the user plays the role of the listener and when active input is required, he or she plays the role a talker. From a HCI point of view we are interested in gathering input to feed symbolic data to the main cognitive engine i.e. the Artificial Intelligence (AI) module that drives the application. With the help of this symbolic, quantified representation of the outside world we a machine understandable layer is created which appropriates actions and strategies suitable for action planning. As an example, paying attention means that the guiding gestures (e.g. pointing to an object) presented by the animated digital human are followed by an appropriate shift of gaze and attention to that region of the screen. Thus, the user’s performance can be gauged by forming and creating expectations and comparing the measured reactions with the anticipated responses. To measure the user’s responses we enabled our virtual human with perception, most importantly the ability to see. As an example, specialized vision modules are responsible for detecting and recognizing one or multiple people in front of the display and analyzing their facial location, expressions, point of gaze and other telltale signs of their attentive and emotional states. Thus, the built in face

recognition and facial information processing module plays a critical role in understanding and appropriately reacting to the user’s needs. Maintaining eye contact and the ability to turn away from or look at the user during the course of interaction believably mimics the everyday communication process taking place between real people and thus subliminally signals the user his or her turn. As a result our virtual human’s ability to deliver non-verbal communication signals to support information content by means of subtle facial gestures is of critical importance in implementing the closed-loop model. Our goal was achieved by creating high fidelity digital copies of living people. These virtual face models can talk, act and deliver over 1000 different facial expressions seamlessly integrated into the communication process. Since people rarely express emotions in front of their computer screens — except when reading email — the virtual human interface system attempts to keep track of the user’s internal state using psychological models of emotion and learning. We then internally model and adapt to these user states and create a mechanism for the virtual agent to express its own feelings with the purpose of modulating the user’s own mood. Finally, body gestures add a further layer to this process. Specifically, the overall direction of the body, hand gestures and beats, transient motions and pointing gestures may be used to indicate action, requested input or direct the user’s attention to a particularly important piece of information. As a result the virtual human interface solution is capable of “driving” the user’s attention and expect certain reactions in response. When those expectations fail to be realized, it may well be an indication that the user has lost interest or could not follow the instructions the digital human was presenting. In such situations, the system is capable to backtrack and adjust its presentation strategy accordingly.

6

The Power of Face-to-Face Interaction for Affective Interfaces

Face-to-face communication with a digital interactive virtual human is one of the most powerful methods for providing personalized and highly efficient information exchange with a built-in emotional component. As introduced in the preceding section, this dialog very precisely governs the role of the “talker” and the “listener” while these roles — that may periodically shift from time to time as the interaction progresses — adhere to a set of rules that are well documented in the communication literature [12-14]. The rules of this on-going dialog can therefore be used very effectively to limit the possible interpretations of the user’s behavior and categorize them as being appropriate or not in a given information exchange scenario. To explain this better let us briefly discuss the “rules” of participating in dialogs. A typical pattern when two people conversing with one another is asymmetrical. It consists of the listener maintaining fairly long gazes at the speaker with short interruptions to glance away, while the speaker looks ate the listener for frequent but much shorter looks. It was estimated that when two people are talking about 60% of the conversation involves gaze and 30% involves mutual gaze, i.e. eye contact [13,14]. People look nearly twice as much (75%) while listening as while speaking

(41%), and they tend to look less when there are objects present, especially if they are related to the conversation. These results suggest that eye gaze is a powerful mechanism to help control the flow of turn taking in a human computer interface dialog. Measuring and interpreting gaze behavior of the user in the context of the face-to-face communication process (as opposed to treating it as a generic random variable) therefore has clear advantages. In particular, we can take advantage of the closed-loop nature of the dialog process in that the communication system no longer passively observes the user, but rather - based in its current assessment of his/her state - it subconsciously prompts them in order to gauge their response. These prompts may occur in the form of multi-modal output, including visual or auditory cues. In conclusion we believe that models of face-to-face communication is one of the most important cornerstones of affective computing.

7

Conclusion

In this paper we described a novel affective Human Computer Interface that presents the process of interaction as a continuous and bi-direction dialog taking place between two parties in a closed-loop virtual environment. This two-way communication was implemented by combining interactive photo-real virtual human models made possible by state-of-the art animation, and real-time rendering techniques and advanced computer vision algorithms to process information obtained from the user. This combined solution successfully created a platform that utilizes the advanced capabilities of virtual humans to express subtle and non-verbal facial signals and subsequently modulate verbal content as a function of the user’s actions and reactions. We argue, that such a closed loop interface, termed here as Affective Intelligence, may serve as the foundation for the next generation of human-centered intelligent devices, interface and agents. These systems will understand what we want from them and act accordingly to provide a wide range of future services.

References 1. Badler, N., M.S. Palmer, and R. Bindiganavale, Animation Control for Real-Time Virtual Humans, Comm. ACM, vol. 42, no. 8, 1999. 2. Terzopoulos, D. and K. Waters, Techniques for Realistic Facial Modeling and Animation, Computer Animation, M. Thalmann and D. Thalmann, eds., Springer-Verlag, 1991. 3. Pasquariello, S. and C. Pelachaud, Greta: A Simple Facial Animation Engine, Proc. 6th Online World Conf. Soft Computing in Industrial Applications, Springer-Verlag, 2001. 4. Morishima, S., Face Analysis and Synthesis, IEEE Signal Processing, vol. 18, no. 3, 2001 5. Cassell, J., H. Vilhjlmsson, and T. Bickmore, BEAT: The Behavior Expression Animation Toolkit, Proc. SIGGRAPH, ACM Press, 2001. 6. Poggi, I., C. Pelachaud, and F. de Rosis, Eye Communication in a Conversational 3D Synthetic Agent, J. Artificial Intelligence, vol. 13, no. 3, 2000. 7. Rickel, J.S. et al., Toward a New Generation of Virtual Humans for Interactive Experiences, IEEE Intelligent Systems, vol. 17, no. 4, 2002.

8. Johnson, W.L. , W. Rickel, and J.C. Lester, Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environments, Int’l J. Artificial Intelligence in Education, vol. 11, 2000. 9. Takács, B., B. Kiss, (2003), Virtual Human Interface: a Photo-realistic Real-Time Digital Human with Perceptive and Communicative Intelligence, IEEE Computer Graphics and Applications, Special Issue on Perceptual Multimodal Interfaces, September-October, 2003. 10. Takács, B. (2005), Special Education & Rehabilitation: Teaching and Healing with Interactive Graphics, , IEEE Computer Graphics and Applications, Special Issue on Computer Graphics in Education, under review. 11. Kiss, B., B. Benedek, G. Szijarto, and B. Takács, (2004), Closed Loop Dialog Model of Face-to-Face Communication with a Photo-Real Virtual Human, SPIE Electronic Imaging Visual Communications and Image Processing, San Jose, California, 2004. 12. Short, J., E. Williams, B. Christie, The Social Psychology of Telecommunications, Wiley, London, 1976. 13. Argyle, M. The Psychology of Interpersonal Behavior, Penguin Books, London, 1967. 14. Argyle, M., M. Cook, Gaze and Mutual Gaze, Cambridge University Press, London, 1977. 15. Digital Elite Inc. www.digitalElite.net