Using head movement to detect listener responses ...

1 downloads 0 Views 1MB Size Report
Louis-philippe Morency, Iwan De Kok, and Jonathan. Gratch. 2008. Context-based Recognition during Hu- man Interactions: Automatic Feature Selection and ...
Using head movement to detect listener responses during multi-party dialogue Stuart A. Battersby & Patrick G.T. Healey Queen Mary University Of London, Interaction, Media & Communication Research Group, School of Electronic Engineering & Computer Science, London, E1 4NS [email protected], [email protected] Abstract Multi-party interactions present some unique problems for the analysis of non-verbal behaviour. The potential complexity of non-verbal cues increases as the number of participants increases. Our analytical approach involves a hybrid of human annotations from video data and machine analysis using motion capture data. The literature often cites eye gaze and head orientation as the critical cues for managing dialogue. In multi-party dialogue the relative importance of these cues changes; participants can only monitor the eye gaze of one person at a time. In previous work we identified the relative contributions of changes in speaker head orientation and gesture orientation on the listeners using data from a corpus of 3-person interactions. We detail here our methodology used in analysing motion capture data and combining it with human annotations, and discuss two approximations of recipient responses designed to highlight interactional patterns: head reorientations and head nods.

1.

Introduction

fectively judged up to rotations of 90 ◦ . The eye gaze issue is exaggerated because, when there are more than two people in the dialogue, who is looking at whom matters more in multi-party dialogue (for example, it can help to determine the intended addressee of the word you (Frampton et al., 2009; Purver et al., 2009)). Utterances and actions can have different interpretations depending upon both speaker and listener orientations.

As face to face dialogue is an open ended activity, it presents challenges for the analysis of non-verbal behaviour. One way to overcome this is to impose constrains upon people. A common form of this is to ask participants to retell a story, typically a scene from a cartoon, to camera. Whilst such an approach provides a narrative in a controlled environment in which the relationship between gesture and speech can be examined, it masks the interactive nature of dialogue (see Goodwin (1979) for evidence that the verbal production of an utterance is dynamic and changes depending upon the actions of the addressee(s)). A demonstration of interactive non-verbal behaviour can be found from Furuyama (2000) who shows that, during an origami tuition task, interlocutors are able to produce gestures that are a product of their collaboration. Also Morency et al. (2008) show that interlocutors’ movements such as nodding are influenced by the local interactional context and Bavelas et al. (1992) discuss the notion of interactive gestures which are said to be a class of gesture which refer to the participants rather than topic of discussion. These gestures also help to manage the dialogue.

Battersby and Healey (submitted) compare the significance of changes in speaker head orientation and gesture orientation during 3-way conversations by measuring the responses of both recipients. Two measures of responses are used: head reorientations - e.g does the recipient turn their head from the speaker to the third person in the interaction - and head nods. Contributions of hand movements and head movements are compared and, in contrast to previous literature, it is shown that changes in speaker hand orientation elicit more and faster responses from recipients than changes in head orientation. Motivated by the need to analyse this data we made use of an analytical approach which uses both human annotation of video and machine analysis of motion capture data (a similar hybrid approach has been employed in Jokinen (2010)). Human annotations were used for identifying the speaker’s target events whilst machine analysis was used for detecting recipient responses. The critical factor in designing the machine analysis was to create measures that allow us approximate changes of orientation and head nods, and hence draw comparisons between people or conditions, rather than precisely detecting events.

Multi-party dialogue provides a more complex environment in which interlocutors must coordinate their dialogue activity than that offered during dyadic dialogue. The literature often cites eye gaze and head orientation as critical cues to activities such as turn taking (see e.g. Argyle (1975)). In a study of listener gaze patterns, Gullberg (2003) found that they fixate the speaker’s face 96% of the time, with only 0.5% of the time spent looking at their gestures. The problem of concern here is that, whilst these cues may be representative of dyadic dialogue, they do not necessarily generalise to multi-party. Interlocutors can only monitor the eye-gaze of one person at a time. Loomis et al. (2008) demonstrate that in situations such as small group conversations, we are only able to reliably judge another’s eye gaze within a 4 ◦ rotation. Head orientation can be ef-

In this paper we focus on the methodology used in Battersby and Healey (submitted) starting with a description of the corpus data, then a discussion of the details of the techniques and finally we present brief results and consider potential extensions and modifications to our approach. 11

2.

Details of the corpus

The corpus used for the study was collected in the Augmented Human Interaction (AHI) lab at Queen Mary. This lab houses an optical motion capture system and video recording equipment. 33 participants took part in the study in groups of 3 meaning the corpus consists of 11 triads. Six tuition tasks were developed that consisted of either a short Java program or a description of a system of government. The material was all text based with no graphical representations. Each group performed three rounds of tuition. On the first round one member of the triad was assigned the role of ‘learner’ and the remaining two members were assigned the roles of ‘instructors’. On the subsequent rounds these roles were rotated so that each person was a learner once. On each round the instructors were given printed tuition material which they were asked to collaboratively teach to the learner. They were allowed to familiarise themselves with the material prior to tuition and then returned it to the experimenter. During this time the learner was removed to another room. The learner and the instructors then sat on stools in the AHI lab and the tuition commenced. There was no time limit and were no restrictions other than they were not allowed to use pen and paper. To motivate the participants to teach and learn, a post-completion test was used.

3.

Figure 1: Three participants during a round of tuition

3.2. Machine Analysis 3.2.1. Overview of the motion capture system The motion capture system used is an optical based system supplied by Vicon. For this study the system was set to record data at 60 frames per second. Each participant wears an upper body motion capture suit and a baseball cap. 27 reflective markers are attached to these which are then tracked by an array of 12 infra-red cameras. The software supplied with the system uses the data from each camera to reconstruct the 3D scene, providing the coordinates of each maker on each frame of data. Once the data has been labelled it can be exported as a tab delimited file and used in other software. We have written analysis software using Python that will read the data files for the whole corpus and perform the quantitative analysis.

Methodology

3.1. Coding video data Each round of tuition is recorded by three video cameras one above and one to either side of the participants (see Figure 1). These cameras are software synchronised via networked computers in order to start and stop together. The videos are imported into ELAN where they are coded for the target events. A code was made for every visible change in speaker head or gesture orientation relative to the recipients. Each of these codes was then sub-coded by the following classification:

3.2.2.

Extracting Head Orientation

• Head Moves: Here the head orientation changes but the gesture remains stationary. For example, the speaker may be gesturing towards the primary recipient and glance (by turning their head) towards the other, secondary, recipient. • Hand Moves: Here the gesture moves, but the head orientation remains stationary. For example, the speaker could be gesturing with a palm up gesture towards the primary recipient and whilst continuing to look at them, turn their gesture so that it is oriented to the secondary recipient.

Figure 2: Head orientation detection

We extract head orientation data for each participant by using the markers on the head to create a vector which extends forwards from their forehead. Whilst it is impossible to determine the exact focus of attention for a person, we approximate attention by comparing the current head orientation to a centre line down the interaction space. For any one person, this center line is a vector that originates at the rear centre of the head and extends midway between the other two people based upon their current position (see Figure 2). This line forms a bisection of the interaction space,

• Both Move: Here both the gesture and the head shift orientation. For example, the speaker could be pointing towards the primary recipient then turn their point along with their head orientation towards the secondary participant. All coding was performed by the 1st author. A random sample of 25 events including target events and control events was coded by a second rater. The inter-rater reliability was good with Kappa = 0.78, p < 0.001. 12

indexed by the the rear of the head. We update this line on every frame of data and use it to determine which person is being oriented to by whom. If any head orientations are within 2 ◦ of the centre line we exclude this data. In addition to using the head orientation as an indicator of focus of attention, we also use it to detect turning of the head, or reorientations. We look for times in which a participant’s head angle crosses over the centre line and class this as an instance of a reorientation. 3.2.3. Extracting Head Nods As with measuring head angle, it is impossible to identify exactly a nod. However, given a set of parameters we are able create a measure that can be used for comparison.

Figure 4: Filtered head marker data. Y axis: amplitude (-4 - 4), X axis: frame number (time)

The data used for the analysis come from a front head marker in the vertical axis. This coordinate data can be interpreted as a signal. As this signal is global to the whole 3D scene, along with nodding it will contain a variety of movement including prosodic body sways, gross shifts in posture, head movement, unintentional body movements, and more. We apply signal processing techniques in order to filter out some unwanted information. We first shift the zero position of the signal to be the mean position of the head marker (see Figure 3). This gives

Figure 5: Peaks and inverted troughs. Y axis: amplitude (0 - 4), X axis: frame number (time)

circa 0.5 seconds) we create a signal which now represents areas of motion (see Figure 6). We define a cut off point of 0.1 above which we code for the presence of a ‘nod’.

Figure 3: Zeroed to mean head marker signal showing all movement. Y axis: amplitude (-40 - 30), X axis: frame number (time) values that show movement above and below this mean position. For the next step we exclude any low frequency movements that could account for body sway etc by removing frequencies below 2Hz. We then remove high frequency movements that could be caused by body shakes, camera error etc by removing frequencies above 8Hz. This filtering gives us a range of frequencies that could plausibly be the frequency of a nod (see Figure 4).

Figure 6: Smoothed final signal which represents periods of nodding. Y axis: amplitude (0 - 0.35), X axis: frame number (time)

We then detect all the peaks and troughs in the signal (which could represent the top and bottom of a nod) that have a movement greater than 1.5cm and produce a new signal of these values. We remove from this signal any movement greater than 5cm from the mean position as this is more likely to be the result of a posture shift than a head nod. Taking the resulting signal we invert the troughs so that we have only a positive signal that represents the motion that we have narrowed down (see Figure 5). By applying a finite smoothing filter to this signal (using a window of

3.3. Detecting responses Our main focus was to compare response rates for particular classes of events. For this we need to judge whether a response has occurred or not. The ELAN transcription files contain the time information of each target event that has been annotated. These are imported into the software and combined with the times for head reorientations and head 13

nods. For each target event we judge the primary recipient as the recipient that the speaker is oriented to at the start of the event, the remaining person is the secondary recipient. This is judged using the head orientation data. When a target event occurs, a 5 second bracket following the event is used to identify any change in orientation and any nod from either of the recipients. The result is the frequency and delay times of response for each class of target event.

were accurately identifying focus of attention. It matters that it indexes movements that are clearly visible to the interlocutors, and have a systematic and marked effect on the interaction. We may be able to refine our technique further. In our analysis of head nodding we make use of one single marker’s movement in the vertical plane. Using the head as a complete segment with rotation data may remove the need for some of the signal processing which excludes unwanted body movements. We also make some logical assumptions as to the frequencies that should be identified in the head motion signals. A measure which allows us to tease apart the frequencies of movement within the signals as the interaction unfolds would lead us to more informed decisions when selecting parameters. A technique which is being use on human interaction data with interesting results is crosswavelet analysis (see Issartel et al. (2006) for details of this).

In order to compare the target events to a baseline of head orientation and head nodding throughout the dialogues, we select random times during a speaker’s utterance (ensuring that these do no coincide with an actual target event). This collection of baseline events are put through the same response detection process. Both groups of events are compared for statistical significance.

4.

Results

Full details of the results are reported in Battersby and Healey (submitted). Over 2 hours and 54 minutes of dialogue, 287 target events were identified. Recipient head orientation was shown to be significant (primary recipients orient to the speaker 68% of the time whereas secondary recipients’ orientation is split 50-50 between the speaker and the primary recipient (χ2 = 16.9, p < 0.001)), and the target events were consequential for the interaction (they elicited a 48.6% head reorientation response rate compared to the 41.3% baseline rate ( χ2 = 5.75, p < 0.05) . We found that changes of orientation of the speaker’s hand reliably produced faster and more frequent recipient responses (significant head reorientation response rate of 63%) than movements of the head (for which there was no reliable difference from the baseline).

We also need to consider an analysis of hand movement as we have shown that it is a highly significant cue for management within dialogue. This movement is less constrained than the movement of the head, and is more susceptible to missing data due to occlusion in the tracking environment so any approaches to automated analysis must be very robust. Our current work tries to integrate this hand movement data by using the speed of the hands to create discrete events (i.e. times of motion) and also for regression analysis. As we are examining interaction data, rather than narratives it is necessary to understand how people move together. As Morency et al. (2008) suggest, a listener’s movements are influenced by the actions of the other participants of the dialogue. A speculation is that people may coordinate their motion (e.g. jointly coordinate their head movements). One approach to this possibility would be to examine cross correlations of head movement and speech between the individuals in the group (see Lavelle et al. (2009) for similar work on interpersonal coordination involving a patient with a diagnosis of schizophrenia).

In addition to these results we can also report on the effect of recipient entry orientation and dialogue role on the response rate. For both the primary recipient and the secondary recipient, we compare their response rate for head reorientations and nodding, depending on whether they are looking at the speaker or not. When measuring head reorientations, we observe that the primary recipient will respond 56.4% of the time when looking at the speaker (and hence turn to the secondary recipient), but 73.5% of the time when looking at the secondary recipient (and hence turn back to the speaker). This difference is significant ( χ2 = 8.65, p = 0.003). This difference is not observed for the secondary recipient, they are equally like to respond depending upon their entry orientation.

6.

We have made initial steps towards an analysis of multiparty non verbal behaviour. By using real face-to-face dialogues to measure listener responses we have respected the interactive nature of dialogue. The technique that we have employed gives us comparable measures so that we are able to see patterns that exist naturally within interaction. These patterns now require experimental analysis to determine when they occur and when they do not. A promising approach for this is to use virtual avatars as an experimental platform.

When looking at head nods these patterns are not observed, there is no significant difference between response rates for entry orientations for either participants.

5.

Conclusion

The findings are significant, not only for understanding the processes of communication, but also for systems which make use of computer vision techniques (such as automated meeting analysis). Techniques are already available to gain access to the movement data of a participant without the need for motion capture equipment. Our findings can guide these techniques so that they are applied to interactionally significant phenomena.

Discussion

Given that we see interactionally significant patterns which are statistically reliable emerging, it is clear that our simple and effective approach is indexing aspects of the participants’ behaviour that are meaningful for the interaction. To a certain extent it does not matter whether the analysis really identified ‘nods’, or whether the orientations 14

7.

References

pages 306–309, London, UK, September. Association for Computational Linguistics.

Michael Argyle. 1975. Bodily Communication. Methuen & Co. Ltd, Bristol. Stuart A Battersby and Patrick G T Healey. submitted. Head and Hand Movements in the Orchestration of Dialogue. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society. Janet Beavin Bavelas, Nicole Chovil, Douglas A Lawrie, and Allan Wade. 1992. Interactive Gestures. Discourse Processes, 15(4):469–489. Matthew Frampton, Raquel Fern´andez, Patrick Ehlen, Mario Christoudias, Trevor Darrell, and Stanley Peters. 2009. Who is “You”? Combining Linguistic and Gaze Features to Resolve Second-Person References in Dialogue. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 273– 281, Athens, Greece, March. Association for Computational Linguistics. Nobuhiro Furuyama. 2000. Gestural Interaction between the instructor and learner in origami instruction. In Language and Gesture. Cambridge University Press. Charles Goodwin. 1979. The interactive construction of a sentence in natural conversation. In G Psathas, editor, Everyday language: Studies in ethnomethodology, pages 97–121. Irvington Publishers. Marianne Gullberg. 2003. Eye movements and gesture in human face-to-face interaction. In J Hy¨on¨a, R Radach, and H Deubel, editors, The mind’s eye: Cognitive and applied aspects of eye movements, pages 685–703. Oxford: Elsevier. Johann Issartel, Ludovic Marin, Thomas Bardainne, Philippe Gaillot, and Marielle Cadopi. 2006. A Practical Guide to TimeFrequency Analysis in the Study of Human Motor Behavior: The Contribution of Wavelet Transform. Journal of Motor Behavior, 38(2):139–159. Kristiina Jokinen, 2010. Gestures and Synchronous Communication Management. Springer. Mary Lavelle, Rosemarie McCabe, Patrick G T Healey, Christopher Frauenberger, and Fabirzio Smeraldi. 2009. Interpersonal Coordination in Schizophrenia: The first 3D analysis of social interaction in schizophrenia. In Proceedings of the International Society for the Psychological Treatments of Schizophrenias and Other Psychoses conference 2009, Copenhagen. Jack M Loomis, Jonathan W Kelly, Matthias Pusch, Jeremy N Bailenson, and Andrew C Beall. 2008. Psychophysics of perceiving eye and head direction with peripheral vision: Implications for the dynamics of eye gaze behaviour. Perception, 37:1443–1457. Louis-philippe Morency, Iwan De Kok, and Jonathan Gratch. 2008. Context-based Recognition during Human Interactions: Automatic Feature Selection and Encoding Dictionary. In ICMI 08, Chania, Crete. Matthew Purver, Raquel Fern´andez, Matthew Frampton, and Stanley Peters. 2009. Cascaded Lexicalised Classifiers for Second-Person Reference Resolution. In Proceedings of the 10th Annual SIGDIAL Meeting on Discourse and Dialogue (SIGDIAL 2009 Conference), 15

Suggest Documents