Audiovisual cues to finality. Pashiera Barkhuysen 1 , Emiel Krahmer 1 , Marc Swerts 1. 1 Faculty of Arts, Tilburg University, Tilburg, The Netherlands. Abstract.
Audiovisual cues to finality Pashiera Barkhuysen 1 , Emiel Krahmer 1 , Marc Swerts 1 1 Faculty of Arts, Tilburg University, Tilburg, The Netherlands
Abstract This paper has two goals. First, to describe a paradigm to combine production of speech with perception. Second, to illustrate this paradigm with a case study. The question explored here is whether there exist auditory and visual cues, such as intonation and facial expressions, that speakers exploit to signal whether or not they want to continue their turn. Participants were able to predict the end of the utterance, and showed the fastest reactions in the audiovisual condition.
Keywords Facial expressions, conversation, turn-taking, finality, intonation
1 Introduction 1.1 Problem statement The main research question is how one can measure spontaneous facial expressions, occuring in natural conditions. Different attempts have been made to develop human-observer-based coding systems, some of which are anatomically based (Cohn, Xiao, Moriyama, Ambadar, & Kanade, 2003). These methods, however, are often timeconsuming. Often it is established in advance what to look for by using deliberate, sometimes posed visual expressions. Research suggests that deliberate visual expressions differ in a fundamental way from spontaneous ones, in that they are more intense, and are inferior and less symmetric in appearance, the latter due to the control by different motor pathways (Cohn et al., 2003). 1.2 Production-Perception Paradigm Our goal, however, is to study visual expressions occuring in natural conditions, expressions which themselves can not be controlled. The conditions, on the other hand, can be controlled. We have developed a paradigm in which we evoke visual expressions in a semispontaneous, but controlled way, so that we know what the exact context of these expressions was (Barkhuysen, Krahmer, & Swerts, 2005; Krahmer & Swerts, in press; Swerts & Krahmer, in press). We take the role of the naive observer who does not know what to look for. By controlling the exact circumstances in which the expressions are evoked, we can at least estimate what the possible functions of the expressions are. In this way we might be able to compare function to form in a later stage. This first part of the process, designed to evoke visual expressions, will be described later in section 2.1 about the production task. The next step in the design is the perception task. Here we extract samples of these expressions from their context, and present them in a different experimental setup to naive observers who have to judge these samples. As they don't know what the context of these samples is, they can base their judgments solely on the cues available in the sample. The judges have to reconstruct the context in a controlled way. If the judgments match the context (i.e. the context can be reconstructed) there is reason to believe
that the available cues carry information value. More about the production-perception paradigm can be found at the website: http://foap.uvt.nl. It is useful to consider the distinction between measurement and judgment of stimuli. On the one hand it is possible to measure the actual facial behavior by EMG (electromyography), MRI-scans (magnetic resonance imaging) or ‘objective’, anatomically based coding systems (Ross, 1999; Wagner, 1997). However, the disadvantage of these methods is that they don’t shed a light onto the information that is conveyed by the facial behavior. In order to address these questions judgment studies have been widely used (Scherer, 2003; Wagner, 1997). Within the field of judgment studies, different types can be distinguished such as the forced-choice method, i.e. to which response category (e.g. finality) does a certain stimulus (e.g. film fragment) belong? (Wagner, 1997) 1.3 Labeling and Cue Masking Our paradigm can be regarded as an implementation of judgment studies, combined with controlled elicitations, on the basis of which we can work towards a more exact measurement of individual expressions. Once it is shown that there are features available in the chosen film fragments, and it is established in which samples (the ones that have been proven to be statistically relevant), we can start to search for individual features. This refinement can be achieved in various ways. One way to do this is by using the inter-observer reliability (e.g. using the kappa-statistic) (Carletta, 1996; Fridlund, 1994; Scherer, 2003; Wagner, 1997). A number of observers look for specific features on the data. Whenever a feature is encountered it is marked on a binary or gradual scale. Their individual scores are compared statistically. If the correlation is high enough, the feature can be regarded as present. These features can then be compared with the judgment scores in the perception task (Scherer, 2003). The correlations will tell whether the present features serve the functions under investigation. Another way to refine the judgment scores is to extract parts of the original samples which seem promising. For example, if there is possible cue value in the eyebrows, one can extract the upper part from the face from a sample and display this in a judgment experiment as well. This procedure can be regarded as a form of cue masking (Scherer, 2003). In this example, correlations between the new and old judgment scores will reveal whether the judgments are mainly based on the upper part of the face and thus possibly the eyebrows. In this way it is possible to refine the model. 1.4 Case Study The next part of this article focuses upon a case study in which we apply the paradigm to audiovisual finality cues. During spoken interactions, conversation participants are able to adequately detect when their partner finishes a turn so that they can elegantly take over, without much overlap or delay. Previous research showed that speakers partly base such decisions on intonational cues (e.g. low- ending contours are reserved for turn-final
position) (Swerts, Bouwhuis, & Collier, 1994). The question explored here is whether there also exist visual cues, such as facial expressions, that speakers exploit to signal whether or not they want to continue their turn.
2 Case Study: Finality 2.1 Production task We collected utterances from 8 speakers, who were asked simple questions that elicited sequences of nouns (varying in length) during an interview in front of a camera. These questions were intended to evoke lists of words, such as questions requiring general knowledge, like "What are the colors of the Dutch flag", or questions eliciting a set of numbers, like "What are the odd numbers between five and fifteen?". The lists that were asked for varied in length, from 3 words to a sequence of 5 words. 2.2 Perception Task We further implemented the design by taking a selection of the collected utterances in the production task, consisting of speakers’ answers without the preceding question of the interviewer. We subsequently presented these to 30 participants during a perception experiment in one of the following modes: the original film containing audio and video (audio-visually or AV), only the visual part of the material (vision-only or VO) or only the auditory part while the visual channel only depicted a static black screen (audio-only or AO). The participants' task was to indicate as fast as possible when they felt a speaker's utterance had reached its end, by pressing a dedicated button at this exact moment. All subjects participated in all three conditions (and none of them had participated as a speaker in the data collection phase). To get a baseline performance, the actual experiment was preceded by a test in which subjects had to respond to stimuli without finality cues. The aim of the baseline session was to find out how long it took subjects on average to respond to a simple stimulus presented in a certain modality and to control for inter-individual differences. The subjects had to press a button as soon as the end of the stimulus was reached. These stimuli consisted of (bimodal or unimodal) presentations of a video still (a single frame of one of the speakers) and/or monotonuous sounds (a stationary /m/ uttered by a male or female voice matching the sexe of the pictures), creating the impression of a speaker uttering a prolonged “mmm”.
Figure 1. The speakers SS and BB while uttering the first and middle word and just after the final word of a three word answer, such as “red, blue, white”.
3 Results The results are displayed in Table 1. Participants were able to predict the end of the utterance, and showed the fastest
reactions in the audiovisual condition. In the baseline session, however, reaction times were slower for the audiovisual stimuli. Table 1. Reaction time in milliseconds for the different conditions in the baseline session and the experiment Baseline
Experiment
AV
430,713
508,757
VO
343,817
688,216
AO
399,567
532,751
4 Discussion and conclusion The findings of the baseline session suggest that cognitive load is higher when subjects need to focus on two modes at the same time. However, if we compare the baseline with experiment the results in the experiment are opposite, suggesting that our perceptual system might be facilitated when two different information sources (visual, auditory) are congruent in their cue value. This case study provided a good basis upon which we can exert a further, more detailed elaboration, which can be achieved via independent labelling or cue masking experiments, as is described in section 1.3. We believe that the production perception paradigm can be easily implemented and provides clear and well interpretable results.
5 References Barkhuysen, P. N., Krahmer, E., & Swerts, M. G. J. (2005). Problem Detection in Human-Machine Interactions based on Facial Expressions of Users. Speech Communication, 45(3), 343-359. Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2), 249-254. Cohn, J. F., Xiao, J., Moriyama, T., Ambadar, Z., & Kanade, T. (2003). Automatic recognition of eye blinking in spontaneously occurring behavior. Behavior Research Methods, Instruments, & Computers, 35(3), 420-428. Fridlund, A. J. (1994). Human facial expression : an evolutionary view. San Diego, Calif. [etc.]: Academic Press. Krahmer, E., & Swerts, M. G. J. (in press). How children and adults produce and perceive uncertainty in audiovisual speech. Language and Speech. Ross, E. D. (1999). Affective prosody and the aprosodias. In M.-M. Mesulam (Ed.), Principles of Behavioral and Cognitive Neurology (pp. 316-331). Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1-2), 227-256. Swerts, M. G. J., Bouwhuis, D. G., & Collier, R. (1994). Melodic cues to the perceived finality of utterances. Journal of the Acoustical Society of America, 96(4), 2064-2075. Swerts, M. G. J., & Krahmer, E. (in press). Audiovisual prosody and feeling of knowing. Journal of Memory and Language. Wagner, H. L. (1997). Methods for the study of facial behavior. In J. A. Russell & J. M. Fernández-Dols (Eds.), The psychology of facial expression (pp. 3154). Cambridge: Cambridge University Press.