Closed Loop Dialog Model of Face-to-Face ... - CiteSeerX

5 downloads 0 Views 251KB Size Report
18–22 January 2004, San Jose, California. 1. Closed Loop Dialog Model of Face-to-Face Communication with a Photo-Real Virtual Human. Bernadette Kiss a.
Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

Closed Loop Dialog Model of Face-to-Face Communication with a Photo-Real Virtual Human Bernadette Kissa, Balázs Benedeka, Gábor Szijártóa, Barnabás Takács*b. a

b

VerAnim, Budapest, HUNGARY WaveBand / Digital Elite, Los Angeles, California, USA ABSTRACT

We describe an advanced Human Computer Interaction (HCI) model that employs photo-realistic virtual humans to provide digital media users with information, learning services and entertainment in a highly personalized and adaptive manner. The system can be used as a computer interface or as a tool to deliver content to end-users. We model the interaction process between the user and the system as part of a closed loop dialog taking place between the participants. This dialog, exploits the most important characteristics of a face-to-face communication process, including the use of non-verbal gestures and meta communication signals to control the flow of information. Our solution is based on a Virtual Human Interface1 (VHI) technology that was specifically designed to be able to create emotional engagement between the virtual agent and the user, thus increasing the efficiency of learning and/or absorbing any information broadcasted through this device. The paper reviews the basic building blocks and technologies needed to create such a system and discusses its advantages over other existing methods. Keywords: virtual human interface, HCI, non-verbal communication, dialog design, artificial intelligence, perceptive intelligence, animation, face recognition, computer vision

1. INTRODUCTION Creating virtual digital humans, particularly animated faces, for the purpose of being used as a universal computer interface has been a key interest of computer graphics and communication research for many decades. It can be argued that photo-real virtual humans - as opposed to non-realistic characters or stylized humanoids – will form the foundation of a new generation of HCI as they are capable of accessing both declarative and procedural memory functions within the brain. However, to mimic the quality of everyday human communication future computer interfaces must combine the benefits of high visual fidelity with conversational intelligence and – above all – the ability to modulate the emotions of their users in a personalized manner. From a perceptual interface standpoint, research suggests that only HCI systems that create a powerful experience really work. While traditional devices and interfaces provide help on how to do things, virtually real digital environments, i.e. photo-real digital humans and synthetic 3D elements, engage the user in a process of emotional responses which in turn opens a channel to “engrave” knowledge and modify behavior. 2. CLOSED-LOOP INTERACTION MODEL The closed loop interaction model of human computer interaction, as described herein, combines the power of high fidelity facial animation with advanced computer vision and perception techniques to create a truly bi-directional user interface where the actions and reactions of the person in front of the computer monitor directly affect the behavior of the interactive content delivery system. In particular, we model the interaction process between the user and the system as part of a closed loop dialog taking place between the participants. This dialog, exploits the most important characteristics of a face-to-face communication, including the use of non-verbal gestures and meta-communication signals to control the flow of information. As such, the Closed Loop Dialog model draws on the characteristics of *

[email protected]; phone +1 310 312-0747; fax +1 310 312-1974; web www.digitalElite.net

1

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

human face-to-face communication specifically the gaze behavior that follows well researched patterns and rules of interaction. As an example, this mechanism -- known as turn-taking -- may appear during the course of a computerized lecture where the role of the listener and the speaker shift periodically. The principle idea behind the closed loop system is to treat the problem of recognizing a user’s internal state as an active optimization problem in contrast to a passive observer method. In this context the goal of the interface is to maximize the user’s ability to absorb the information being presented. This goal is achieved by constantly measuring levels of interest, attention and perhaps fatigue. In such a way the interface framework is enabled to consider how the user\s most important resource, namely his or her attention is allocated over time. To maximize the gain in knowledge we must minimize the cost of interaction by reducing the overall demand on attention. In other words, the system must be able to “read” the many telltale signs the user is projecting during the course of interaction. These signs - all of us read them during the coarse of our daily interactions with others - can be readily derived from visual cues made available to the training system using a simple web-camera. The dialog model assumes that while being instructed, the user plays the role of the listener and when active input is required, he or she plays the role a talker. From a HCI point of view we are interested in gathering input for the main AI module that driving the application and create a symbolic, quantified representation on which appropriate actions and strategies can be planned. As an example, paying attention means that the guiding gestures (e.g. pointing to an object) presented by the animated digital human are followed by an appropriate shift of gaze and attention to that region of the screen. Thus, the user’s performance can be gauged by forming and creating expectations and comparing the measured reactions with the anticipated responses. To measure the user’s responses we enabled our virtual human with perception, specifically the ability to see. As an example, specialized vision modules are responsible for detecting and recognizing one or multiple people in front of the display and analyzing their facial location, expressions, point of gaze and other telltale signs of their attentive and emotional states. Thus, the built in face recognition and facial information processing module plays a critical role in understanding and appropriately reacting to the user’s needs. Maintaining eye contact and the ability to turn away from or look at the user during the course of interaction believably mimics the everyday communication process taking place between real people and thus subliminally signals the user his or her turn. The virtual human\s ability to deliver non-verbal communication signals to support information content by means of subtle facial gestures is of critical importance in implementing the closed-loop model. To achieve this goal we designed and implemented high fidelity digital copies of living people. These virtual face models can talk, act and deliver over 1000 different facial expressions seamlessly integrated into the communication process. Since people rarely express emotions in front of their computer screens -- except when reading email -- the virtual human interface system attempts to keep track of the user’s internal state using psychological models of emotion and learning. We then internally model and adapt to these user states and create a mechanism for the virtual agent to express its own feelings with the purpose of modulating the user’s own mood. Body gestures add a further layer to this process. Specifically, the overall direction of the body, hand gestures and beats, transient motions and pointing gestures may be used to indicate action, requested input or direct the user’s attention to a particularly important piece of information. As a result the virtual human interface solution is capable of “driving” the user’s attention and expect certain reactions in response. When those expectations fail to be realized, it may well be an indication that the user has lost interest or could not follow the instructions the digital human was presenting. In such situations, the system is capable to backtrack and adjust its presentation strategy accordingly. 3. ADVANTAGES OF FACE-TO-FACE DIGITAL COMMUNICATION Face-to-face communication with a digital interactive virtual human is one of the most powerful methods for providing personalized and highly efficient information exchange. At the same time learning, in the general sense, takes place in the form of a unique dialog between system and its user. As described in the preceding section, this dialog very precisely governs the role of the “talker” and the “listener” in every moment. These roles, that may periodically shift from time to time as the interaction progresses, adhere to a set of rules that are well documented in the communication literature2.

2

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

The rules of this on-going dialog can therefore be used very effectively to limit the possible interpretations of the user’s behavior and categorize them as being appropriate or not in a given information exchange scenario. To explain this better let us briefly discuss the “rules” of participating in dialogs. A typical pattern when two people conversing with one another is asymmetrical. It consists of the listener maintaining fairly long gazes at the speaker with short interruptions to glance away, while the speaker looks ate the listener for frequent but much shorter looks. It was estimated that when two people are talking about 60% of the conversation involves gaze and 30% involves mutual gaze, i.e. eye contact3,4. People look nearly twice as much (75%) while listening as while speaking (41%), and they tend to look less when there are objects present, especially if they are related to the conversation. These results suggest that eye gaze is a powerful mechanism to help control the flow of turn taking in a human computer interface dialog. Measuring and interpreting gaze behavior of the user in the context of the face-to-face communication process (as opposed to treating it as a generic random variable) therefore has clear advantages. In particular, we can take advantage of the closed-loop nature of the dialog process in that the communication system no longer passively observes the user, but rather - based in its current assessment of his/her state - it subconsciously prompts them in order to gauge their response. These prompts may occur in the form of multi-modal output, including visual or auditory cues. 4. AFFECTIVE STATES IN LEARNING The most important affective states in the computer of human computer interaction and specifically learning are interest, boredom, excitement, confusion, and fatigue. These are frequently accompanied by so called so surface level behaviors such as different patterns of facial expressions, eye-gaze, had nod, hand movement, gestures and body posture. These behaviors can be interpreted as measures of paying attention or simply as indicators of being “on-task” and/or “offtask” 5. The direction of eye gaze and head orientation, for example, are prime indicators of a user’s focus of attention. In an on-task state, the focus of attention is mainly toward the problem the person is working on, whereas in an off-task state the eye-gaze might wander off. Similarly, pupil dilation is also know to express a level if interest. When we find a particular subject fascinating, our pupils are unconsciously dilated, as to opening up to the speaker. On the other hand, boredom can be detected by the withdrawal expressed in the user by the diminished pupils. Spontaneous facial expressions and head nods are also good indicators of motivational and affective states. In particular, approving head nods, facial actions like smile, tightening of the eyelids while concentrating, eyes widening or raising eyebrows suggest interest and excitement (on-task).

Fig.1. Surface level behaviors as indicators of a user’s internal state2.

3

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

On the other hand, head shakes, the lowering of eyebrows, nose wrinkling and depressed lower lip corners indicate offtask behavior. Finally, on a greater scale, body postures convey specific meanings regarding the actions of the user in front of the computer terminal. Leaning forward towards the computer screen might be a sign of attention while slumping on the chair or fidgeting suggests frustration/boredom5. Figure 1. illustrates the different On/Off Tasks as they relate to various non-verbal, meta-communication signals. [Note: The AU notation refers to the specific facial action units (FACS6) involved in the respective expressions]. 5. BUILDING A PHOTO-REAL VIRTUAL HUMAN The success of the closed loop HCI dialog system partly depends on the system’s ability to mimic a face-to-face communication process between user and computer in which the digital human acts and appears as believable as possible. Although much research in this area has been conducted using relatively low-resolution digital faces7 and video-based dialog systems8, however the notion of using high resolution digital animated humans have not yet been explored. Building a photo-real virtual human used for HCI involves multiple steps including detailed skeletal modeling, body construction with underlying muscle deformation systems, hand gesture libraries, facial animation, hair and finally cloth which is added later. Figure 2. demonstrates the various aspects of the digital human modeling process we use in the closed loop dialog system.

Fig.2. Multiple steps involved in creating a believable real-time virtual human for HCI.

Perhaps, the most challenging aspect of this process is the ability to model faces that can express the mimics and small nuances characteristic of human non-verbal communication. To date high fidelity facial modeling remained a challenge in part due to the fact that the modeling process becomes exponentially more difficult as the resolution of the required output surface increases. Creating models of few hundred polygons, like most other systems do, is relatively easy by

4

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

manually selecting a small set of corresponding reference points and/or using basic re-meshing algorithms. However, when one needs to determine dense correspondences of 30-100K points these techniques fail to deliver reliable results. The technical challenge further intensifies when different facial expressions come into play requiring advanced tracking solutions to be developed. The resolution of the face models in our system range from 30 – 100K polygons and readily include the skull, eyes, lashes, teeth and tongue. The body and cloth typically comprise of an additional 50-100K polygons, while hair may be added as geometry or a render effect. In addition to high geometric and deformation fidelity the latest advances in PC-based graphics hardware also allow to model better textures and lighting effects, including time dependent sub-surface events, like blushing, pupil dilation or even changing skin reflection properties as a function of sweat or stress. Our Virtual Human Interface system supports the automatic animation and control of these subtle details thereby further increasing the language of non-verbal signals at the virtual human’s disposal to evoke the user’s emotional relation. Figure 3 show and example of a high fidelity female character while talking with the user. The content, i.e. “what is being said” comes from a text to speech engine or a prerecorded set of sentences. The emotional modulation, i.e. “how it is said” is dynamically rendered as a function of user interaction. The resulting output is a real-time animated human capable of looking at the user providing him or her with the necessary piece of information they requested.

Fig. 3. Subtle changes of emotional display of a photo-real female character during interaction.

6. COMMUNICATIVE INTELLIGENCE & BEHAVIOR DESIGN In the context of the closed loop dialog model of HCI, perceptive & communicative intelligence is defined as the system’s capability of being aware of the number of users in front of the terminal, maintain eye contact and look at them when necessary, follow their motions as they lean sideway or backwards and being able to deliver and understand nonverbal bi-directional communication signals -- such as nodding, or point of regard -- to support the spoken & animated content. Our approach is to make the user believe that the virtual character is paying personalized attention or in simpler terms to “fake” intelligence by producing communicative signals that convince the user that the virtual humans’ reactions are actually in response to and in accordance with his or her actions in front of the computer terminal.

5

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

The behavior of the virtual character can therefore be designed as a simple directed execution graph, much similar to a Call Center application. However, since each node branches according to the user’s presence and actions in front of the web-camera, the overall result is a convincingly reactive dialog. To address the issue of creating new dialog-based situations we use the visual application designer as shown in Figure 4. In this formulation the interaction process is progressing from a start point towards an end or exit point whereby a decision in each node to which way to branch is made based on the current input and the history. The directed graph describing the information accessible can be designed and produced off-line, prior to the interaction occurs, yet the actual execution depends on variables related to the user’s actions. This simple methodology ensures the non-deterministic, yet controllable execution of the original since unlike in classical graph-based systems, the variables here that control the branching process are directly derived from the users’ presence. The visual designer itself generates an ASCII script file that eventually controls the dynamics of the interaction process. In the VHI system, to ensure maximal compatibility with existing solutions, multiple layers of scripting are supported and available. These include TCL/TK, LUA than can further be embedded into XML and HTML pages for other applications to use locally or remotely, over digital networks.

Fig. 4: Visual content and application designer used to create non-deterministic interactive dialogs.

7. VISUAL PERCEPTION The facial analysis module employs a hierarchical 3D tracking methodology that uses classical face recognition algorithms and a 3D head model fitted to the user’s image as captured by the video camera. In order to estimate what is happening on the user’s face, we first decouple the global head motion from the non-rigid local deformations that represent expressions. The first step in this process is to use a head and face finder algorithm that verifies the presence and estimates the coarse orientation of the head in the web-camera’s field of view. Subsequently we use a highly detailed facial expression model and a robust binary image metric to find the facial expression best matching this captured input. One of the most important advantages of 3D model-based tracking is that it can help interpret a wide range of facial expressions, even those that occur at the micro-expression level. However, to take advantage of this

6

Virtual Human DialogModel Digital Elite Inc., Los Angeles

Visual Communications and Image Processing 2004 (EI25) SPIE Electronic Imaging, 18–22 January 2004, San Jose, California

technique one must have detailed facial deformation models for many people. To achieve this goal we thus first had to capture and analyze detailed facial 3D models of multiple people that included up to sixty different facial deformations. These facial deformation shapes were then used to define a person-independent face-space that could be matched to images of the user as captured by the web camera. This is demonstrated in Figure 5. For this figure we used a set of expressions that are virtually identical when examined in the classical tracked facial feature space. The model-based approach, however, is capable of finding the best matching expression and arriving on the correct interpretation of the input. The specific expressions from left to right and top down are “Full Blow”, “Upper Blow”, “Kiss” and “Mouth Left”. In each sub-image the original input is shown on the left, while the 3D models (gray) fitted to the images are shown on the right, respectively.

Fig. 5. Example of 3D model-based tracking and analysis of subtle expressions by fitting parametric 3D expression-space models to images captured by a web-camera.

8. CONCLUSION In this paper we described a novel method of designing interactive Human Computer Interfaces that treat the interaction process as a continuous and bi-direction dialog taking place between user and his or her computer. This two-way communication was implemented by combining interactive photo-real virtual human models made possible by state-ofthe art animation and real-time rendering techniques with advanced computer vision algorithms that process facial information obtained from the user. This combined solution successfully created a platform that utilizes the advanced capabilities of the virtual human to express subtle and non-verbal facial signals to modulate verbal content as a function of the user’s actions and reactions. We argue, that such a closed loop interface may serve as the foundation for the next generation of human-centered devices that understand what we want from them and act accordingly.

1. 2. 3. 4. 5. 6. 7. 8.

9. REFERENCES B. Takács, B. Kiss, The Virtual Human Interface: A Photo-realistic Digital Human, in IEEE Computer Graphics and Applications Special Issue on Perceptual Multimodal Interfaces, September-October, 2003. J. Short, E. Williams, B. Christie, The Social Psychology of Telecommunications, Wiley, London, 1976. M. Argyle, The Psychology of Interpersonal Behavior, Penguin Books, London, 1967. M. Argyle, M., M. Cook, Gaze and Mutual Gaze, Cambridge University Press, London, 1977. A. Kapoor, S. Mota, R.W. Picard, Towards a Learning Companion that Recognizes Affect, in Proceedings of Emotional and Intelligent Agents II: The Tangled Knot of Social Cognition, AAAI Fall Symposium, 2001. P. Ekman, H. Oster, Facial expressions of emotion, Annual Review of Psychology, 20, 527-554, 1979. S. Morishima, Face Analysis and Synthesis, IEEE Signal Processing Magazine, Vol.18, No.3, pp.26-34, 2001. W.G. Harless, M.A. Zier, M.G, Harless, R.C. Duncan, Virtual Conversations: An Interface to Knowledge, in IEEE Computer Graphics and Applications Special Issue on Perceptual Multimodal Interfaces, September-October, 2003.

7