Building the databases needed to understand rich, spontaneous human behaviour Roddy Cowie Queen's University Belfast BT7 1NN Northern Ireland
[email protected]
o
Abstract One of the motives for studying faces and gestures is the role that they play in spontaneous, socially rich interaction between humans. If computers are to interact with humans in that mode (or to analyse what they are doing in it), methods of interpreting the non-verbal signals that they use are critical. It is becoming clear that developing those methods requires databases whose complexity is ofa very different order from those that are standard elsewhere. Samples cannot be generated to order, because acting does not reproduce the wayfeatures are distributed in spontaneous action. Data collections need to be very large, because there are extensive situational, individual, and cultural differences. It is a very large task to provide annotations that adequately capture the meaning of facial expressions or gestures. These problems are not insoluble, but sustained and welldirected efforts are needed to solve them.
1. Introduction The traditional image of human-computer interaction involves a person consenting to behave (and think) in a special way, so that the computer will receive the special kind of input it needs. One of the aspirations of contemporary research is to break down that paradigm, by equipping computers to register the kinds of signal that humans find it natural to give when they are not pressed by circumstances to make deliberate choices about the channels they use to communicate. There is no agreed name for that kind of activity, but 'rich, spontaneous communication' poses fewer problems than most alternatives. Dealing with rich, spontaneous communication is key to a wide range of applications. Some key examples are: o Fluent discourse with people (a key to various ends); o Facilitating access for people who cannot easily adopt a computer-friendly style - either because of their own enduring characteristics (age, disability, etc) or because of the demands of the situation (carrying out a challenging task); o Tasks that depend on adapting to the user's emotional and cognitive state - a key issue for in-car support systems, artificial companions, automatic tutors, etc.;
978-1-4244-2154-1/08/$25.00 ©2008 IE
Monitoring humans behaving in complex ways, for security, health, market research, etc; o Searching records that involve rich, spontaneous human behaviours (e.g. meetings) or feelings that are difficult to verbalise (eg music or photographs); o 'Edutainment' (museum guides, etc.); o Enhancing artistic performance; o Immersive gaming. Part of the challenge in these areas is to detect new kinds of pattern, with face and gesture prominent among them. But it is equally fundamental that the interpretation of the patterns is very unlike the interpretation of a keypress or a written character. An integral part of the task is to draw far-reaching conclusions about people from behaviors that are not deliberately framed to communicate that information. The conclusions involve issues such as what people are perceiving (or failing to perceive); which elements of the environment they are concerned with; their states of mind (in terms of mood, emotion, stress levels, etc); their stances towards key people, things, or issues (friendly, curious, dismissive, etc); their enduring relationships with key people (fondness, rivalry, etc); their intentions (communicatively, socially or physically); the difficulties that they are facing; their prospects; and so on. Everyday human interaction is flexible, robust and effective because it is grounded in the ability to draw conclusions like those. Matching that ability is integral to equipping computers to register the kinds of signal that humans find it natural to give. It is an enormous, intellectually fascinating challenge. Not all research on automatic analysis of face and gesture is directed to that challenge. Humans can and do adopt specialised ways of using face and gesture as well as other modalities. But the attraction of research on face and gesture is bound up with the intuition that it can reduce the need for humans to adapt their communicative styles to accommodate machines. The theme of this paper is that meaningful progress in that direction depends not only on advances in signal processing, but also on radical developments in the way we think about, construct, and distribute databases. Enough has been done in some of the relevant areas to suggest where the potential difficulties lie, and how the research communities concerned might go about addressing them. Articulating the issues at an early stage
has the potential to save a great deal of effort that would otherwise be wasted, or counterproductive.
2. Lessons from emotion databases Emotion recognition is one of the best developed topics in the area. The topic is appealing because emotion in a broad sense - 'pervasive emotion' - is something that colours most of human life (being unemotional is a rare state); and it is central to quality of life. The problems surrounding emotion databases are a useful way to introduce the wider issues.
2.1. The Ekman paradigm and its problems Paul Ekman and his collaborators provided two linked resources that had an enormous influence on automatic emotion recognition. One was a psychological theory proposing that behind the surface complexity lay a simple underlying set of six 'basic emotions'[8]. The other was a database showing photographs of faces simulating the basic emotions (the examples were chosen to fit the theory) [9]. The two together invite an appealing conception of the recognition task: that is, to assign photographs from the database to the categories defined by the theory. Work is still done in that mould, but it is fair to argue that its most significant contribution has been to expose problems that were not initially obvious, and that are central to modem research. Three stand out. o The slightness of transfer It is fair to hope that prototypical expressions generated by actors might be ideal training material, but the strategy depends on transfer to the situations where they are to be applied. In fact, transfer beyond the type of material used for training seems to be disappointingly slight - i.e. developing real applications requires realistic data. o The depth of multimodality The standard interpretation of a smile is happiness, but when it is combined with a catch in the voice and a harrowing story, it is probably 'putting on a brave face' - which is a signal, but not of happiness. Such effects are not at all uncommon in realistic data: information about feelings often lies in the way various elements relate to each other and to context, both synchronously and diachronically (over time). Hence, research on face and gesture will be considered from here on as part of a larger effort, not a self-contained one. o The enormity of annotation Recordings need to be annotated in ways that capture both the relevant physical movement patterns and their significance. Annotating the physical patterns is time-consuming even in the Ekman paradigm, but with naturalistic material the physical patterns to be considered extend over time and modalities. The effort required
becomes massive; and what is expressed is likely to be difficult to articulate at all, let alone to annotate. It is interesting that these problems are relatively intuitive. The rest of the section adds evidence and examples to back up the intuition.
2.2. The slightness of transfer The issues surrounding naturalistic and acted data are complex and arouse strong feelings, making it difficult to form balanced judgments. The core attractions of acted material are linked to preferences that are deeply ingrained in the technological community. Because the actor knows what emotion he or she is meant to convey at a given time, there is a kind of ground truth about the emotion being expressed. In contrast, labellings of naturalistic material typically express observers' judgments, which could quite possibly be wrong. Because actors can provide idealized examples, labellings can have high inter-observer agreement. In contrast, agreement is often quite low in labellings of naturalistic material. Actors can generate as many instances of a particular type as a balanced database requires. In contrast, naturalistic approaches are bound by frequency of occurrence, and that may cause major problems. For example, a landmark study by Ang et al [I] analysed 14h 36min of recordings. The commonest strong emotion was frustration, of which there were 42 unequivocal instances. Similarly, recordings of actors can use whatever lighting, camera angles, and microphone arrays suit the technology to be used for analysis. In contrast, the average person is very unlikely to show rich spontaneous behaviour when he or she is facing directly into a camera under bright lighting. Stating the case in that way highlights the need to think carefully about the effect of transferring familiar standards into a new area. In particular, various aspects of what people reasonably think of as good practice draw research towards using actors rather than genuine samples of rich, spontaneous behaviour. There is no problem if acted material is actually an acceptable surrogate. But if it is not, then standard assumptions have to be re-evaluated. Later sections consider some of the shifts that may be called for. The evidence on acted material is frustratingly incomplete, and some of the key sources deal with voice rather than face and gesture. But it is hard to think of any evidence that suggests acted material is a good basis for training systems to recognize emotion in an application context, and a good deal suggests the opposite. It is revealing that people find it easy to tell whether material is acted. A recent study in our laboratory took clips from TV channels, some showing spontaneous expressions of emotion, others showing extracts from dramas in which actors simulated comparable emotions.
As they watched, raters used a slider to indicate whether the material they were watching was acted or natural, and how confident they were. The distinction was not perceived instantly, but discrimination was statistically significant within a few seconds, and stabilised in about fifteen. The only enduring confusion involved an actress expressing (presumably) genuine emotion. In the case of faces, at least one key variable is known. When individual frames from a sequence are rated for the emotion they show, natural sequences show more change from frame to frame than acted ones [19]. In speech, too, natural emotion samples are marked by more variety, in attributes such as pausing, timing, and laryngeal setting [12]. Those effects are presumably linked to the complexity that is typical of natural emotions [6]. An intuitive summary is that what actors do is to concentrate the behaviors that express emotion: the problem is that the result does not indicate how the same signs (or some of them) are likely to be distributed in the complex context of a normal interaction. Batliner and his colleagues demonstrated the practical consequences [3]. They obtained data from successively closer approximations to interaction between a human and an artificial dialogue system, from wholly acted to genuine dialogue with a fully automatic speech dialogue telephone system, and trained systems to recognize the key emotions that occurred. When systems trained on acted material were applied to the natural material, performance was poor. On the basis of formal and informal evidence, most teams with a long standing interest in emotion appear to agree with the conclusion of a recent panel that "If ... corpora are acted or contrived, then the resulting technology will be of little use.". [4]. Many go further, and argue that training data needs to be not only realistic, but also derived from a close model of the application situation. It is clear that these conclusions have major implications for database development. They are at two levels. First, there is probably no alternative to investing the effort needed to collecting naturalistic data. Second, it may be necessary to develop databases that are not only naturalistic, but closely tailored to the applications envisaged.
2.3. The depth of multimodality In the context of emotion, multimodality needs to be understood in a very broad sense. One level is that different expressive modalities, such as face and voice, may convey different impressions when they are considered in isolation [7]. Reaching a balanced assessment depends on taking all of these modalities into account. However, several other levels also need to be
considered. Linguistic content has already been mentioned. It has been shown that spotting the words a person speaks can improve emotion recognition [11]. However, qualitative analyses underscore the case for considering much deeper kinds of analysis. Gestures and facial expressions may function as a counterpoint to verbal expressions of irony, conspiracy, and so on; and to understand that kind of function, a deep understanding of the linguistic context needs to be available. More radically, expressions are related to physical and social and intellectual context. Without contextual information, it may be clear that a person is reacting strongly to something, but not whether the reaction is positive or negative. Again, these conclusions have major implications for database development. It is often crucial for databases to record not only the person expressing an emotion, but also the context in which the expression is occurring.
2.4. The enormity of annotation It is apparent from what has already been said that it is a large task to annotate records of rich spontaneous behaviour in a way that captures all the information that is genuinely likely to be significant. The HOMAINE database [6] represents a sustained attempt to provide a model of the annotation that is needed. The database consists of 'clips' (audiovisual recordings lasting at most a few minutes). Each clip receives three levels of labeling. Global descriptors describe the clip as a whole. They have a key role in indexing as well as covering properties that are generally constant across a clip (features of recording, type of social context, etc). Sign labels describe the physical activities that convey relevant information. Emotion labels describe psychological states of the person concerned. The system of global descriptors is relatively underdeveloped, but considering the other two levels gives some sense of the scale of the task. Sign labels cover three modalities, face; gesture (including posture); and speech. Two sublevels are distinguished (not always consistently). Low-level sign labels can be recovered by direct signal processing techniques. They include FAPs for the face; FO, pause and voicing information, and spectral (or cepstral) descriptors for speech; and so on. Mid-level descriptors for speech depend on human listeners. They include transliteration and descriptions of complex events such as hesitation, repetition, and paralinguistic elements. The database does not cover types of smile and nod, but they logically belong at this level. Descriptions of gestures blur the distinction. Mid-level descriptors assigned by expert labelers specify types and phases of gestures are, but parameters (e.g.
speed and amplitude) are recovered automatically. The set of emotion labels reflects efforts by several teams to capture the states that actually appear in naturalistic data in a tractable format [5],[7]. The main format consists of 'traces', which show an observer's instant-by instant ratings of a specified psychological attribute. Seven basic traces are used, showing overall intensity of emotion; extent to which a genuine emotion is being masked; extent to which an emotion that is not felt is being simulated; the 'core affect' dimensions of valence and activation; and appraisal-related traces for perceived control of events and extent to which events are anticipated. These are supplemented by traces showing how strongly the person appears to be experiencing standard types of emotion (anger, happiness, etc): the types considered reflect what is present in the clip. The point here is simply to convey the scale of the description that the HUMAINE database provides. In fact, a genuinely satisfying labeling would go well beyond what has been said. The database covers issues that have not been described here (e.g. descriptions of what a person is angry or sad about), and there are important areas that the database does not cover (eg gaze). One of the virtues of the database is that the clips themselves expose issues that need more work (they were deliberately chosen for that purpose). Clearly, providing annotation on the scale that has been sketched here is a very different undertaking from describing the action units in a still photograph and assigning one of six category labels to it. But if the goal is genuine understanding of rich, spontaneous communication, it is difficult to imagine being satisfied with less.
2.5. Lessons from emotion Emotion, even in the broad sense of pervasive emotion, is part of rich, spontaneous communication, not the whole of it. It may be a particularly difficult area in some respects. However, it is one of the areas where there has been most sustained engagement with the issues surrounding rich, spontaneous communication, and it would be ill-advised to ignore the experience that has accumulated in the area. The most basic lesson is that developing suitable databases is a major challenge, both conceptually and in terms of scale.
3. Beyond straightforward emotion Research on emotion is particularly well-developed, but there is enough work on related topics to provide some sense of what is needed to build a picture of rich, spontaneous behaviour in a broader sense. The scale of the task that emerges is very large.
3.1. The diversity of behavior Typically, emotion databases show individuals expressing an emotion-related state in what is assumed to be a reasonably normal way. However, research on rich, spontaneous communication should ideally have access to databases dealing with a much wider range of behaviours. One aim of this summary is to reflect the steady growth of resources that has taken place in many relevant areas. However, it also aims to highlight a recurring concern, which is an extension of one that has already been raised. The material often consists of relatively pure examples of a particular type of sign. However, that does not necessarily indicate what to expect when similar types of sign are embedded in a complex context. Hence the collections that are being developed may embody the simplest possible instances of a domain rather than exhausting it. 3.1.1 Expressive behaviours There are various expressive uses of face, gesture and voice that are only indirectly related to emotion, and data is available or being collected for several of them. There are databases that consider both improvised and systematic forms of communication where ann and hand gesture is the primary medium. EmoTaboo [25] provides a substantial collection of improvised gestures (from a miming game). There are also databases for sign languages such as ASL, which bring together gesture and face in a linguistic structure [17]; and others that treat whole body movement as an expressive medium [24]. It remains to be tested how these resources relate to situations where facial and manual gestures are used to support primarily linguistic communication. However, some specialized ways of embedding gesture in conversation have been studied. They involve politeness [20] and backchannelling [13]. It is clear that they observe very specific rules of selection and timing. 3.1.2 Tasks involving emotion The archetypal emotion database shows people whose main concern is expressing themselves. However, there is increasing interest in material where emotion-related signals are given by someone who is focused on carrying out a task. Emotion is crucial for special types of communication, such as persuasion, and the HUMAINE database includes samples from recordings of attempts to persuade people on an emotive topic (adopting a 'green' lifestyle). Cognitive tasks often elicit states which are neither purely emotional nor independent of emotion, such as being sure, interested, or thoughtful. Baron-Cohen's work on them has aroused strong interest in computational research [14] [10]. Shortening Baron-Cohen's description, they can be called epistemic - emotional. Emotion is also a significant factor in perceptual-motor
tasks. Driving is the most studied example [18]. It shows that signs of emotion may be radically different from standard lists: abrupt braking and swerving, impetuous overtaking, and so on.
3.1.3
Structured socialtasks
Tasks located in a structured social setting are both common and important. In that category, there has been particular interest in meetings, and the AMI databases are particularly influential [15]. Emotion in a strong sense is not salient, but there are abundant signs involving various kinds of gesture, politeness, backchannelling, and so on. More generally, there is a good deal of work on group processes, concerned with issues such as dominance support, and so on [21]. Different sets of signals come t~ prominence there, involving issues such as time speaking and frequency of interrupting or being interrupted. An intriguing extreme has been considered by Hansen, who recorded soldiers being interrogated by a military tribunal [23]. It is most unlikely that emotion in that setting would be expressed in the same way as it might be in a fairground ride.
3.1.4
Cultural and individual differences
In principle, all of the issues that have been discussed are affected by cultural and individual differences. The effects involve both behaviour and sensitivity. In practice, relatively few of databases show how individual and cultural differences are expressed [22]. 3.1.5 Overview of diversity The question behind this section is how many dimensions of diversity databases need to reckon with and how subtly they interact. It is disturbing, but entirely possible, that a genuinely satisfying set of databases needs to cover all combinations of all the variables implicit in the discussion.
3.2. Multiplying the enormity of labelling Many of the topics considered in the previous section have distinctive descriptive schemes associated with them. For example, there are well-established ways of coding the way a dialogue is being conducted. Abelin has proposed an extensive scheme (called MUMIN) for coding dialogue that involves non-verbal elements [16]. Interaction Process Analysis [2] is a well-known system that codes the underlying goals and social effects of complex, sustained interactions at a more abstract level. More detailed schemes have been developed to code subtle but powerful communicative phenomena like irony. The list could easily be extended. Clearly not all types of material need all of the types of coding that are relevant to the area as a whole. However, it seems fair to think in terms of a default set of issues that one would expect to be coded in order to capture the kind of information that is likely to be involved in rich,
spontaneous communication. Such a scheme would clearly cover far more than, for instance, either the HUMAINE database scheme or MUMIN. Implementing a even a minimal version of that kind of scheme would take staggering amounts of effort.
3.3. The scale of the task It is natural to imagine a situation where a team working on a new kind of detection mechanism could go to a collection of existing databases and rely on finding material that presented the relevant kinds of signals and the relevant kinds of psychological state, both interleaved with other elements a way that does not seriously misrepresent the task that is likely to occur in real life. Thinking through the issues suggests that enormous amounts of effort would have to be invested in database construction to reach that point. The alternative route is for each team to build databases that meet its own needs. That raises its own problems. It means that teams need to acquire the expertise needed to create databases that adequately represent the problem to be addressed, and invest large amounts of effort in constructing them. There is a clear risk that databases made in that way will not do full justice to the problems, and comparing results becomes all but impossible.
4. Towards a strategy It would be serious for the health of the whole field if the development of databases were not appropriately directed and supported. This section outlines some of the issues that are central to appropriate development. What to record? Several strategies for collecting material make sense, and complement each other. Collections like the HUMAINE database represent a breadth first strategy: they deliberately try to represent the range of material that needs to be considered. They have an important role in developing widely applicable labeling schemes, for instance. Collections like the AMI database represent a depth first strategy: they identify a promising application area, and assemble a large amount of relevant material. AMI's application area is meetings: others might be instruction, co-operative work settings, crowd behaviour. A third strategy which also makes sense might be called surface first. It is directed towards detection of signs rather than complex states, and aims to provide balanced samples to train low-level detectors. Collections like Emotaboo and MMI are broadly in that category. How to record? Data collection is massively complicated by the need to cater to low- and mid-level analysis which, by human standards, is dismal. Recording is dominated by highly artificial lighting, camera angles and microphone arrays because information cannot be extracted from recordings that the human eye and ear
process easily; and the need to reproduce the exact configuration involved in an application multiplies the amount of recording needed. Addressing rich, spontaneous communication provides a very strong incentive to aim towards the kind of constancy that human sensory systems achieve, rather than placing the burden on the recording process. Invest in automatic feature recognition Related to the last point, it is important to maximize the amount of labeling that can be done automatically. It is highly inefficient to rely on human coders to extract head pose, gaze direction, action units, hand shape, and so on. Consolidate labeling tools and vocabulary There are few, if accepted models, let alone standards, for the midand high-level descriptions relevant to coding rich, spontaneous communication. A coherent research effort is impossible without convergence at that level. Establish genuine interdisciplinarity Related to the last point, engaging with rich, spontaneous communication calls for skills that are not part of a standard engineering education. Psychologists and social scientists often do have the relevant skills, but rarely understand the constraints that technological research has to deal with. It is a core challenge to create career frameworks where the different sets of skills can interact effectively and prosper. Establish gigabyte publishing Work with genuinely large databases is beset with practical problems - funding them, making them available, searching them, and (not least) receiving academic credit for them. Progress on those issues is a sine qua non - without it, nothing.
5. Conclusion Understanding rich, spontaneous communication is not just an engineering problem. It is hard to think of more fundamental challenges. Progress depends on resource; and no resource is more important than data.
References [1] J Ang et al 2002. Prosody-based automatic detection of annoyance and frustration in human-computer dialog. Proc. ICSLP 2002, Denver, Colorado, Sept. 2002 [2] R. Bales Interaction Process Analysis: A Method for the Study of Small Groups Addison-Wesley, Cambridge 1950 [3] A. Batliner, K Fischer, R.Huber, J. Spilker & E. Noth, How to find trouble in communication. Speech Communication 40,2003, 117-143 [4] N. Campbell et al Resources for the Processing of Affect in Interactions Report of Panel on Resources for the Processing of Affect in Interactions, LREC 2006 [5] R.. Cowie& R. Cornelius (2003). Describing the Emotional States that are expressed in Speech. Speech Communication, 40, 5-32.
[6] E. Douglas-Cowie et al The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data ACII 2007 pp 488-500 [7] L. Devillers et al..(2006) Real life emotions in French and English TV video clips: an integrated annotation protocol combining continuous and discrete approaches. Proceedings LREC 2006 [8] P. Ekman (1993). Facial expression of emotion. American Psychologist, 48, 384-392 [9] P. Ekman & W Friesen (1975a). Pictures of facial affect. Palo Alto, CA: Consulting Psychologists' Press. [10] R. EI Kaliouby & P. Robinson, Mind reading machines: automated inference of cognitive mental states from video IEEE International Conference on Systems, Man and Cybernetics, 2004 vol. I pp 682-688 [II] N Fragopanagos & J Taylor Emotion recognition in human-computer interaction Neural Networks 18,2005, 389-405 [12] S. Frigo The Relationship between Acted and Naturalistic Emotional Corpora Proceedings of Workshop on Corpora for Research on Emotion and Affect, LREC06 [13] D.KJ. Heylen, A. Nijholt and M. Poel Generating Nonverbal Signals for a Sensitive Artificial Listener, in A. Esposito et al (eds) Verbal and Nonverbal Communication Behaviours. Springer Verlag, Berlin, 2007: pp. 264-274. [14] http://autismcoach.com/Mind%20Reading.htm [15] http://corpus.amiproject.orgl [16]http://www.cst.dk/muminlworkshop200406/MUMI N-coding-scheme-vl.3.doc [17] A.M. Martinez, et al The RVL-SLLL ASL Database Proc. IEEE International Conference Multimodal Interfaces, 2002 [18] E. McMahon et al. Multimodal records of driving influenced by induced emotion Proc of Workshop on Corpora for research on emotion and affect, LREC 2008 [19] M. McRorie & I Sneddon Real Emotion is Dynamic and Interactive Proc ACII 2007 759-760 [20] M. Rehm & E. Andre More Than Just a Friendly Phrase: Multimodal Aspects of Polite Behavior in Agents In T. Nishida ed. Conversational Informatics p 69-84 [21]http://www.idiap.ch/%7Evincia/papers/bravetopic.p df [22] http://www.isaga2007.nl/uploads/pp/p71-pres.pdf [23]V. Varadarajan, J. Hansen, I. Ayako UT-SCOPEA corpus for Speech under Cognitive/Physical task Stress and Emotion Proceedings of Workshop on Corpora for Research on Emotion and Affect, LREC06 [24] M. Yinguang; H. Paterson; F. Pollick A motion capture library for the study of identity, gender, and emotion perception from biological motion Behavior Research Methods, vol. 38,2006 pp. 134-141 [25] A. Zara et al. Collection and Annotation of a Corpus of Human-Human Multimodal Interactions: Emotion Proc ACII 2007 pp 464-475