Behavior Research Methods 2008, 40 (3), 830-839 doi: 10.3758/BRM.40.3.830
Understanding intention from minimal displays of human activity PHIL MCALEER AND FRANK E. POLLICK University of Glasgow, Glasgow, Scotland The impression of animacy from the motion of simple shapes typically relies on synthetically defined motion patterns resulting in pseudorepresentations of human movement. Thus, it is unclear how these synthetic motions relate to actual biological agents. To clarify this relationship, we introduce a novel approach that uses video processing to reduce full-video displays of human interactions to animacy displays, thus creating animate shapes whose motions are directly derived from human actions. Furthermore, this technique facilitates the comparison of interactions in animacy displays from different viewpoints—an area that has yet to be researched. We introduce two experiments in which animacy displays were created showing six dyadic interactions from two viewpoints, incorporating cues altering the quantity of the visual information available. With a six-alternative forced choice task, results indicate that animacy displays can be created via this naturalistic technique and reveal a previously unreported advantage for viewing intentional motion from an overhead viewpoint.
If we were to open our eyes to any crowded scene, it would be possible to understand the actions and intentions of those around us. However, the effortless nature with which we reach this understanding belies the complex processing of visual information that appears necessary to perform this task. In order to begin the study of the underlying mechanisms of action understanding, it is useful to have simplified scenarios that provide a tractable amount of information. One source of these scenarios has been provided by what are known as animacy displays. Animacy displays involve the motion of simple geometric shapes that evoke spontaneous attributions of life and social intent. In their classic work, Heider and Simmel (1944) investigated how observers would describe the movement of such shapes (i.e., a large triangle, a small triangle, and a small circle) as they navigated around a large open square. Their findings showed that people ascribe meaning to the movements, personifying the shapes and attributing emotions and goals. Further work focused on the temporal and spatial dynamics of these displays (Bassili, 1976), with various authors proposing that these properties, rather than other physical qualities of the displays, are key to the attribution of intention (Bloom & Veres, 1999; Scholl & Tremoulet, 2000; Tremoulet & Feldman, 2000; Zacks, 2004). Despite research investigating the attribution of intention in animacy displays via spatial and temporal dynamics, there have been few studies that point to specific categories of actions that facilitate the ascription of intention. Research that has focused on this aspect is that of Blythe, Todd, and Miller (1999) and Barrett, Todd, Miller, and Blythe (2005). Blythe et al. established six categories of
intentional motion that they claimed were basic to all animacy displays: chasing, evading, courting, being courted, fighting, and playing. The authors argued that these basic categories make up a large proportion of human motion and are fundamental to the development and survival of people. Blythe et al. created representations of these intentions by instructing participants to control animated ants on a computer screen and to move them as though the ants were performing the intentions. On showing these displays to a second population of participants, the authors found that the observers were able to distinguish one intention from another at levels above chance. Blythe et al. found only minimal confusion between displays, with a small bias toward calling displays play. The authors claimed that this bias was due to an underlying belief that play is the more common intention and that we learn other intentions via play. Barrett et al. (2005), using a modified set of intentions—chasing, courting, fighting, following, guarding, and playing—showed results similar to those in Blythe et al., obtaining high accuracy in judgments of intention. The authors also showed a response bias for play and again showed slight confusions—for example, between following and chasing. In the present study, we propose to continue the investigation of the attribution of social intention in animacy displays, while presenting a new practice for the creation of animacy displays that permits sophisticated manipulation of the available visual information. Typically, the production of animacy displays has relied on a variety of synthetic means ranging from clever animations and parametric variations of simple motion patterns to complex
P. McAleer,
[email protected]
Copyright 2008 Psychonomic Society, Inc.
830
INTENTION FROM MINIMAL DISPLAYS computer game scenarios similar to those in Blythe et al. (1999). These methods result in pseudorepresentations of human movement, and it is unclear how these synthetic motions relate to actual biological agents. To clarify this relationship, we introduce a novel approach involving a subtractive, automatic technique for progressively reducing video recordings of human actions to animacy displays with geometric shapes, the motions of which are directly derived from human actions. Furthermore, this technique has the added advantage of facilitating the comparison of the same action/interaction, in animacy displays, from different viewpoints, and with varying levels of visual information. Although previous research has made use of more than one viewpoint to present stimuli to participants, there is no empirical research that has directly compared the ability to judge and perceive intentions in animacy displays from different viewpoints. In animacy research, a few studies have made use of displays from a side view (Csibra, Gergely, Bíró, Koós, & Brockbank, 1999; Gergely, Nádasdy, Csibra, & Bíró, 1995; Kuhlmeier, Wynn, & Bloom, 2003); however, the majority of studies have incorporated animacy displays viewed from overhead (Bassili, 1976; Blakemore et al., 2003; Bloom & Veres, 1999; Blythe et al., 1999; Castelli, Frith, Happé, & Frith, 2002; Gelman, Durgin, & Kaufman, 1995; Heider & Simmel, 1944; Tremoulet & Feldman, 2000), with other studies leaving the viewpoint ambiguous (Szego & Rutherford, 2007). It would appear that no previous research has made a direct comparison of ability to judge and perceive intentions across these two viewpoints. In this report, we present two experiments that made use of the subtractive method for the creation of the animacy displays discussed. In the first experiment, we compared the ability of participants to discriminate between intentions in both the original video displays and the resultant animacy displays. Participants were asked to discriminate between the six intentions from Barrett et al. (2005): chasing, courting, fighting, following, guarding, and playing. Furthermore, we compared the ability to discriminate between these intentions, in both the original video and the animacy displays, across the two viewpoints of overhead and side view. It was expected that the participants would be able to successfully discriminate these intentions in all the conditions; however, the ability to do so would decrease in the animacy displays, in accordance with the decreased visual information available. No predictions were made regarding the change in ability to discriminate intentions across viewpoints. In the second experiment, we investigated the effect of increasing the available visual information in the animacy displays on the attribution of intention, by incorporating occlusion and contextual cues. We hypothesized that the participants would distinguish between the six intentions of chasing, courting, fighting, following, guarding, and playing at levels above chance, in displays depicting either an overhead or a side viewpoint. Furthermore, we predicted that increasing the available visual information would improve this ability. Results consistent with these hypotheses would validate this new procedure for the creation of animacy displays and would improve our knowledge of the role of view-
831
point and other visual cues in the understanding and attribution of social intention. EXPERIMENT 1 The aim of this experiment was to introduce a new technique for the creation of animacy displays from video displays of human motion and to show that this is a valid technique for the creation of such displays. The relevance of this new method is that it bridges the gap between the attribution of intentional motion by people on the basis of observations of pseudorepresentations of human movement and the attribution of intention to displays of actual human activity. In order to test the validity of this technique, we explored observers’ abilities to recognize and categorize the intentions that are being portrayed by actors in video displays and in the animacy displays derived from these videos. It was hypothesized that the participants would be able to determine intentions accurately in both display conditions: video and animacy displays. Furthermore, we made use of an advantage of this subtractive technique to explore the perception of intention from two differing viewpoints: overhead and side view. Method Participants. Sixteen participants took part in the experiment. All were naive as to the purpose of the study, had normal or corrected-to-normal vision, and were paid for their participation. The rights of all the participants were protected, and all procedures conformed to the British Psychological Society code of conduct and the standards of the University of Glasgow Faculty of Information and Mathematical Sciences Ethics Committee. Stimuli. Two actors performed five examples of the six social interactions stipulated by Barrett et al. (2005): chasing, courting,1 fighting, following, guarding, and playing. For each intention, the actors were provided with basic instructions as to how the actions were to be performed. The actors were filmed on a 5-m square stage with a black floor and three black walls, one on either side and one at the back. The actors were dressed in white body suits, including hoods. Two video cameras were used to film the scenes: One camera, a Sony DCR-TRV950E with a 12, f/1.6 optical zoom lens (3.6– 43.2 mm), fitted with a 0.3 magnification wide-angle lens adaptor, was positioned directly above the center of the stage at a height of 6 m and captured the entire stage; the second camera, a JVC GR-DV700EK with a 10, f/1.2 optical zoom lens (3.8–38 mm), was positioned on a tripod approximately 5 m from the center of the front edge of the stage and captured the horizontal span of the stage. The footage was extracted from the cameras and clipped into the relevant segments, using Adobe Premier Pro 1.5. These segments were converted into black and white, to minimize any lighting effects in the original displays. To create the animacy displays, first the coordinates for each actor, representing the center of mass of the silhouette of the actor, were extracted using the EyesWeb open platform for multimedia analysis (www.eyesweb .org) (see the Appendix; Camurri, Krumhansl, Mazzarino, & Volpe, 2004; McAleer et al., 2004). MATLAB (The MathWorks, Natick, MA) was incorporated in order to implement a digital filter of the coordinates, using dual pass of a 4th order butterworth low-pass filter, with a cutoff frequency of 0.8 Hz. Filtering effectively eliminated the vertical component of gait motion from the side view and, thus, made the side and overhead views more equivalent. MATLAB was used to create the animacy displays from the filtered coordinates, depicting white circles, representing the actors, on a uniform gray background. One display from each intention was randomly chosen as experimental stimuli. An example of a single frame from
832
MCALEER AND POLLICK Video
Animacy
Side View
Overhead
Figure 1. Examples of the original footage, on the left, and a representation of the displays after being processed through EyesWeb and MATLAB, on the right.
an original video display and its resultant animacy display, for both viewpoints, can be seen in Figure 1. The filtered coordinates were rendered as QuickTime Movies at a frame rate of 25 fps, matching the recording rate of the original footage, with a mean duration of 33 sec. Online examples of the experimental displays can be found at www.psy.gla.ac.uk/~phil/movies .html, with the coordinates available at www.psy.gla.ac.uk/~phil/ co-ordinates.html. Procedure. The experiment was run on a G4 Apple Macintosh (OS 9.2) using a combination of MATLAB 5, the Psychophysics Toolbox, Version 2.5 (Brainard, 1997; Pelli, 1997), and ShowTime (Watson & Hu, 1999). The experiment consisted of 48 trials split into four blocks of 12 trials, with 3 practice trials to familiarize the participants with the task. Using a 6 (intention) 2 (display condition) 2 (viewpoint) design, the participants saw each intention twice, at both viewpoints, for the two experimental stimuli display conditions. After each display, the participants selected the intention that they thought had been portrayed in the display, via a six-alternative forced choice (6AFC) task. The participants were randomly allocated into one of two groups, in which they would see either all the animacy display trials first or all the video display trials first. Therefore, the presentation of display condition (animacy vs. video) was pseudorandomized, with all trials within blocks being completely randomized.
ing, fighting, flirting, following, guarding, or playing). No overall main effect of order was found (62.5% vs. 58.3% correct for viewing video or animacy displays first, respectively). The results were therefore collapsed across order, and a three-way repeated measures ANOVA was run, using only the within factors of viewpoint, display condition, and intention. The overall ability to differentiate intentions for both display conditions, across the two viewpoints, can be seen in Figure 2. In addition, the overall confusion matrices for each intention, at the four display conditions, can be seen in Table 1 for overhead displays and Table 2 for side view displays. The participants were clearly able to differentiate the intentions at levels above chance (16.67% correct for a 6AFC task) at both viewpoints across the two display conditions of video and animacy. The results suggest that the ability to discriminate intentions is better when the original video footage is viewed than when the derived animacy displays are viewed, and also when overhead displays are viewed than when side view displays are viewed. The ANOVA revealed a significant main effect of viewpoint [F(1,15) 17.55, p .01], which indicated that intention discrimination was better for overhead displays (67.4%) than for side view displays (53.4%). A significant main effect of display condition [F(1,15) 32.95, p .01] was found, with Fisher’s LSD post hoc comparison showing that the participants were better at judging intention in the video displays (72.4%) than in the animacy displays (48.4%). Furthermore, a significant main effect of intention was shown [F(5,75) 7.86, p .05]. Fisher’s LSD revealed that the participants were significantly better at categorizing the intentions of chasing (68%), fighting (61.7%), flirting (69.5%), and following (75.8%) than at categorizing the intentions of guarding (44.5%) and playing (43%) and were better at categorizing following than at categorizing fighting. The ANOVA revealed an interaction between viewpoint and intention [F(5,75) 2.63, p .05]. A Tukey HSD
1 Video
Animacy
Proportion Correct
.8
Results and Discussion We investigated people’s ability to discriminate the intentions of actors in video displays and in the equivalent animacy displays derived from the original footage. We proposed that the participants would be able to successfully discriminate between intentions in the two display conditions—animacy and video—but that the ability to do so would decrease in the animacy condition, due to the removal of visual information. We also tested the ability to discriminate intentions across the two viewpoints: overhead and side view. No prediction was made as to the effect of viewpoint. Initially, a four-way repeated measures ANOVA was run using order (video first or animacy first) as a between factor and, as within factors, viewpoint (overhead or side), display condition (video or animacy), and intention (chas-
.6
.4
.2
0 Overhead
Side View
Figure 2. Experimental results showing the ability of observers to differentiate intentions across the two experimental display conditions (video and animacy) at both viewpoints. Error bars indicate standard errors, and the bold dashed line indicates the chance level of .1667.
INTENTION FROM MINIMAL DISPLAYS
833
Table 1 Confusion Matrices for Both Viewpoints in the Video Displays Showing Proportion Correct for Each Intention (n 16) Response Flirting Following
Intention
Chasing
Fighting
Chasing Fighting Flirting Following Guarding Playing
.97 .19 .00 .00 .22 .00
.03 .75 .00 .00 .03 .00
Overhead View .00 .00 .69 .00 .00 .00
Chasing Fighting Flirting Following Guarding Playing
.72 .31 .00 .00 .13 .00
.06 .53 .00 .00 .00 .00
Side View .00 .00 .84 .00 .00 .00
post hoc analysis showed that the participants were better at categorizing following from the overhead view (89.1%) than at categorizing fighting (43.8%), guarding (42.2%), and playing (37.5%) from the side view and guarding (46.9%) and playing (48.4%) from overhead. The participants were better at categorizing the overhead fighting displays (79.7%) than at categorizing the side views of fighting and playing and at categorizing the overhead chasing displays (73.4%) and side view flirting displays (71.9%) than at categorizing the side view play displays. The ANOVA also revealed a significant interaction between display condition and intention [F(5,75) 3.51, p .05]. Again, Tukey’s HSD was used, revealing that the video displays of following (96.8%) were recognized better than the video displays of guarding (60.9%) and playing (51.6%) and than all animacy displays except flirting (62.3%): chasing (51.5%), fighting (59.4%), following (54.7%), guarding (28.1%), and playing (34.4%). It was also shown that the chasing video displays (84.4%) were recognized significantly better than the animacy displays of guarding and playing and that the fighting and flirting video displays
Guarding
Playing
.00 .00 .03 .97 .00 .34
.00 .06 .28 .00 .59 .03
.00 .00 .00 .03 .16 .63
.00 .00 .00 .97 .03 .56
.22 .16 .16 .00 .63 .03
.00 .00 .00 .03 .22 .41
(64.1% and 76.6%, respectively) were recognized significantly better than the animacy guarding display. Finally, a three-way interaction was found between viewpoint, display condition, and intention [F(5,75) 4.09, p .05]. Tukey’s HSD revealed numerous significant differences revealing that the video display of following (from both viewpoints), the overhead video chasing display, the side view video flirting display, and the overhead animacy displays of fighting and following were all categorized significantly better than the majority of the other displays. The results from the video display conditions, for both viewpoints, show little confusion, since all the intentions were categorized with a large degree of success, the lowest being playing from the side view (41%), although this was still well above chance (16.67%). The only noticeable confusions in the overhead displays were a slight tendency to confuse flirting as guarding, guarding as playing, and playing as following. In the side view displays, there was an increased degree of confusion, in comparison with the overhead viewpoint displays, although, again, all the
Table 2 Confusion Matrices for Both Viewpoints in the Animacy Displays Showing Proportion Correct for Each Intention (n 16) Intention
Chasing
Fighting
Response Flirting Following
Chasing Fighting Flirting Following Guarding Playing
.50 .09 .00 .00 .13 .00
.19 .84 .16 .00 .25 .00
Overhead View .06 .03 .66 .00 .13 .00
Chasing Fighting Flirting Following Guarding Playing
.53 .28 .03 .06 .56 .22
.06 .34 .00 .06 .09 .00
Side View .13 .03 .59 .09 .09 .06
Guarding
Playing
.00 .00 .00 .81 .00 .66
.25 .03 .19 .03 .34 .00
.00 .00 .00 .16 .16 .34
.00 .00 .00 .28 .00 .25
.28 .34 .38 .34 .22 .13
.00 .00 .00 .16 .03 .34
834
MCALEER AND POLLICK
intentions were categorized well above chance level. Of note, confusions included categorizing chasing as guarding, fighting as chasing, playing as following, and again, guarding as playing. In contrast to the results from the video displays, we saw larger degrees of confusion in the matrices for the animacy displays, which would coincide with an overall reduced ability to perceive and categorize the intentions appropriately in this display condition. The overhead animacy results showed confusion particularly between playing and following and between guarding and all the other displays except following. In the side view animacy displays, as in the video displays, the overall level of confusion increased, again supporting the reduced ability to correctly categorize intentions in this condition. Large confusions existed between chasing and guarding, between fighting, chasing, and guarding, between flirting and guarding, and between playing, chasing, and following. Overall, in the animacy displays, we witnessed a bias toward calling displays guarding, especially in the side view displays, which is inconsistent with biases toward playing, as reported by Blythe et al. (1999) and Barrett et al. (2005). No other systematic biases were found. The outcomes of this first experiment made it evident that people could accurately attribute the correct intention to the movements of human actors in video displays and to the same movements in displays in which the visual information was markedly reduced—that is, depicted via two animate circles. These results would appear to support the use of the subtractive technique, introduced above, that allows for the creation of animacy displays from tracking actual human movement. The importance of this method is that these animacy displays can, in turn, be analyzed to provide insight as to the motion and kinematic parameters that are used for the determination of intention by observers of human motion. That said, overall categorization of intentions could perhaps have been better, in both display conditions. Regarding the video display conditions, the varying degrees of confusion reveal in which displays the intended intention is most salient and in which displays the intent can be confused for another. An explanation for these confusions would be that in the displays in which there were large confusions, the movements in these displays contained elements that made the overall intentions ambiguous. Although these displays could have been removed via careful piloting, it must be noted that they also provided a first step toward discovering the boundaries between these intentions, in terms of motion properties and movement trajectories. The confusion matrices showed that the ability to categorize intentions was reduced when we diminished the available visual information. We found a drop in overall ability to perceive the appropriate intention in the animacy displays, as compared with the video displays, and this ability was reduced further when we compared side view animacy displays with overhead animacy displays. This reduction may have been due to difficulties that the participants had in adopting the appropriate viewpoint or to inaccurate tracking of the white circles in these animate displays. This tracking problem would, indeed, be most
problematic in the side view animacy displays and may point to the relevance of establishing ordinal depth for successful intention categorization. We therefore introduced a second experiment in which we sought to address these proposed causes of reduced performance in establishing intentions in animacy displays. In the following experiment, we explored cues to occlusion and context, which increased aspects of the visual information, in order to test whether reintroducing some information could improve the ability to perceive the correct intention in animacy displays. EXPERIMENT 2 Experiment 1 introduced a technique for the extraction of coordinates from video displays of human movement. From these coordinates, we were able to create animacy displays that were direct representations of the original footage, although with largely diminished visual information available. With these displays, we examined observers’ abilities to correctly recognize and categorize the intentions displayed in both the video displays and the animacy displays. Furthermore, we carried out a direct comparison of the same ability across the two viewpoints of overhead and side views. Although both of these intentions have been readily used in the animacy literature, to the best of our knowledge, this is the first time that such a comparison has been investigated. It was shown that the participants were able to successfully categorize intentions in both video displays and animacy displays across the two available viewpoints. The results indicated that the ability to categorize intention was better when the video displays were shown. This would be expected given that the animacy displays contained only a subset of the information in the videos. The experiment also indicated that in regard to animacy displays, categorization was better for displays shown from overhead, with possible explanations revolving around the establishing of the appropriate viewpoint stance and of the ordinal depth of the protagonists. The purpose of this second experiment was to continue to explore the attribution of intention to animacy displays and to investigate the effect of the addition of cues designed to increase the visual information available. Two cues were explored: (1) an occlusion cue and (2) a contextual cue. The results from Experiment 1 indicated that performance was better with overhead animacy displays, as compared with side view displays, and it is suggested that this may have been due to the difficulty in the task of tracking two identical white circles and determining which one was which. By coloring one of these circles black, this problem should ease, allowing the observers to quickly separate the circles. Furthermore, although this cue would appear to have no influence on the overhead displays, since, in theory, there should be no occlusion, this cue does provide a cue to identity that may have some influence on the ability to categorize intention in overhead displays. The second cue, context, has relevance in both viewpoints. In the previous experiment, the animacy displays showed two white circles on a uniform gray background, and it is possible that the participants may have
INTENTION FROM MINIMAL DISPLAYS
A
B
C
835
D
Figure 3. Four experimental conditions introducing additional visual cues. The top panel shows the side view displays, and the bottom panel shows the overhead displays. The four conditions are (A) no occlusion, no context (NONC); (B) no occlusion, context present (NOC); (C) occlusion present, no context (ONC); and (D) occlusion and context present (OC).
had trouble distinguishing which viewpoint to take on certain displays. This, in turn, may have hampered the ability to correctly perceive intentions. The contextual cue in this experiment added boundaries to appropriately indicate whether the observer should view the display from the side or from above. It was hypothesized that by adding cues of occlusion/ identity and context to the basic animacy displays in Experiment 1, we would improve the ability of the participants to correctly categorize the intentions displayed in these animacy displays derived from scenes of human activity. Method Participants. Seventeen new participants took part in the experiment. All were naive as to the purpose of the study, had normal or corrected-to-normal vision, and were paid for their participation. The rights of all the participants were protected, and all procedures conformed to the British Psychological Society code of conduct and to the standards of the University of Glasgow Faculty of Information and Mathematical Sciences Ethics Committee. Stimuli. The starting stimuli were the six overhead and side view video displays that were created in Experiment 1. The experimental stimuli—that is, animacy displays showing an overhead view and a side view—were created in the same manner as in Experiment 1. In addition to these basic displays, cues were added via MATLAB. In order to investigate the effect of increasing the available visual information, two cues to viewpoint were added—a contextual cue involving boundaries (C), thus increasing the viewpoint saliency, and an occlusion cue (O) giving ordinal depth information to side view displays 2 —resulting in four experimental display conditions, which are shown in Figure 3. These conditions included (1) no occlusion, no context (NONC), in which actors were represented as white circles on a gray background; (2) no occlusion, context present (NOC), which was the same as Condition 1, except that each display had a boundary surrounding it to suggest viewpoint (overhead displays had four surrounding white lines; side view displays had three surrounding lines, one beneath the circles and one on either side); (3) occlusion present, no context (ONC), in which one actor was depicted as a white circle and one actor was depicted as a black circle on a gray background; and (4) occlusion and context present (OC), which was the same as Condition 3, but with the boundaries from Condition 2.
Procedure. The experiment was run on a G4 Apple Macintosh (OS 9.2) using a combination of MATLAB 5, the Psychophysics Toolbox Version 2.5 (Brainard, 1997; Pelli, 1997), and ShowTime (Watson & Hu, 1999). Each experiment consisted of 96 trials split into four blocks of 24 trials, with 3 practice trials to familiarize the participants with the task. Using a 6 (intentions) 4 (display conditions) 2 (viewpoints) design, the participants saw each intention twice, at both viewpoints, for the four experimental conditions. After each display, using a 6AFC task, the participants selected the intention that they thought had been portrayed in the display.
Results and Discussion We investigated people’s ability to discriminate social intentions using animacy displays derived from human motion. We proposed that observers would be able to clearly differentiate the correct intention for each display from both a side view and an overhead view, at levels greater than chance. Furthermore, we examined whether increasing the visual information via cues of context and occlusion would increase participants’ ability to differentiate intentions. The overall ability to differentiate intentions for each viewpoint, across the four experimental conditions, is summarized in Figure 4. Again, the overall confusion matrices for each intention, collapsed across all four display conditions, can be seen in Table 3 for the overhead and side view displays. As in Experiment 1, the participants were clearly able to differentiate the intentions at levels above chance (16.67% correct for a 6AFC task) at both viewpoints across all display conditions. Furthermore, this ability was again improved when the displays showed an overhead view, rather than a side view. The ANOVA revealed a significant main effect of viewpoint [F(1,16) 22.14, p .01], indicating that the participants were better at judging intention in overhead displays (52.1%) than in side view displays (36.9%). A significant main effect of display condition was found [F(3,48) 3.35, p .05], with Fisher’s LSD revealing that the participants were better at categorizing intentions when shown displays
836
MCALEER AND POLLICK 1 Side view
Overhead
Proportion Correct
.8
.6
.4
.2
0 NONC
NOC
ONC
OC
Display Condition Figure 4. Experimental results showing the ability of observers to differentiate intentions across the four experimental conditions at both viewpoints. Error bars indicate standard errors, and the bold dashed line indicates the chance level of .1667.
in Experimental Display Condition 4 (OC; 49.3%) than in the three other experimental display conditions (NONC, 44.1%; NOC, 41.2%; and ONC, 43.9%). A significant main effect of intention was found [F(5,80) 3.33, p .05], with Fisher’s LSD revealing that the participants were significantly better at categorizing the intentions of flirting (54.5%) and following (58.5%) than at categorizing the intentions of guarding (35.7%), playing (37.1%), and chasing (38.2%). Finally, the ANOVA revealed an interaction between display condition and intention [F(15,240) 2.37, p .05]. A Tukey HSD post hoc analysis showed that the participants were better at categorizing the intention of following shown in Experimental Display Condition 4 (OC; 73.5%) than they were at categorizing guarding (29.4%), playing (36.7%), and chasing (30.9%) in Experimental Condition 1 (NONC); guarding (33.8%), playing (36.8%), and chas-
ing (41.2%) in Experimental Condition 2 (NOC); fighting (35.3%), guarding (38.2%), playing (39.7%), and chasing (30.8%) in Experimental Condition 3 (ONC); and finally, fighting (38.2%), guarding (41.2%), and playing (35.3%) in Experimental Condition 4 (OC). There were no other significant interactions. The results from the overhead display show confusion patterns similar to those described by Blythe et al. (1999) and Barrett et al. (2005). From Table 3, it can be seen that for overhead displays, we find clear confusions between fighting and playing and between chasing and following, whereas the displays of following, flirting, and guarding had less systematic confusions. The confusion patterns for the side view displays show greater overall confusion and, hence, lower percent correct scores. Chase–follow confusions and fight–play confusions similar to those for the overhead displays were found, although guarding and
Table 3 Confusion Matrices for Both Viewpoints Showing Proportion Correct for Each Intention Collapsed Across the Four Experimental Display Conditions (n 17) Response Playing Flirting
Intention
Chasing
Fighting
Chasing Fighting Playing Flirting Guarding Following
.54 .03 .13 .01 .01 .24
.01 .40 .30 .01 .18 .01
Overhead View .06 .32 .43 .14 .14 .02
Chasing Fighting Playing Flirting Guarding Following
.22 .02 .07 .01 .00 .05
.05 .46 .44 .05 .33 .05
Side View .21 .35 .32 .26 .29 .20
Guarding
Following
.01 .14 .04 .59 .16 .04
.01 .10 .05 .23 .50 .03
.38 .01 .05 .02 .01 .67
.04 .10 .10 .49 .15 .10
.11 .03 .04 .17 .21 .12
.38 .03 .04 .02 .01 .49
INTENTION FROM MINIMAL DISPLAYS flirting were also confused with playing. Finally, most misidentifications were called playing, suggesting a bias toward this intention. This is in contrast to Experiment 1, in which no systematic bias toward playing was found; instead, a slight bias toward guarding was found. The participants were, indeed, able to distinguish intentions at levels greater than chance in both viewpoints, with an overall advantage for displays showing the overhead (52.1%) versus the side (36.9%) view. This ability was consistent over all intentions, except for the fighting display, where the ability was similar for both viewpoints. It was also found that the context and occlusion cues added to increase visual information did not provide an overall boost to perceiving the correct intention. The advantage for perceiving intentions in the overhead displays in this experiment was in keeping with the findings of the first experiment, where a decreased ability was seen when side view displays were compared with overhead displays, in both the animacy and the video display conditions. The difference in the ability to categorize intentions between overhead and side views of the animacy displays in the present experiment is in line with the results of Experiment 1, with only a marginal overall reduction in the present experiment. This suggests that the pattern of the overhead view’s being categorized better than the side view is robust, since two independent groups of participants consistently produced the same relationship. It has been shown that adding contextual cues to indicate viewpoint has no overall effect on the ability to determine the correct intention. However, an inability to determine the veridical viewpoint of the displays could explain the lower performance in the side view displays, and the mere addition of the contextual cue gives no clear indication that participants do, indeed, perceive the displays from the appropriate viewpoint. Thus, a subsequent experiment was performed to directly examine participants’ abilities at determining the viewpoint of each display.3 The results showed that people were slightly better at determining the overhead viewpoint, as compared with the side view (83.6% and 80%, respectively), but that this difference was not statistically significant ( p .05). Therefore, it is improbable that the difference in performance between viewpoints is a result of people’s not perceiving the displays with the veridical viewpoint. That people are better at attributing intentions to displays shown from an overhead viewpoint is a surprising aspect, given that the side view is the more common viewpoint from which we observe the actions of others. One possible explanation is that in overhead displays, more information is available as to the location of one protagonist with respect to that of another. However, the occlusion cue provided information in the side view displays that gave ordinal depth to the circles, yet the results showed only a small, but statistically significant, increase in ability when this cue was combined with a boundary cue. This suggests that the additional information utilized in the overhead view exceeds the ordinal distance information provided from the occlusion cue. Further research might explore, in more precise terms, which particular depth or distance relationships are important for distinguishing an intention,
837
via adapting a procedure similar to that in Zacks (2004) or Blythe et al. (1999), where kinematic properties of the motion are correlated with the behavioral responses. This would allow discussion as to what motion properties, such as speed, acceleration, and distance, are most effective for the distinguishing of intentions. In turn, this would clarify the confusions witnessed in the presented studies, particularly in the displays in which the motion would appear to be ambiguous and fall near the boundaries of what determines different intentions. GENERAL CONCLUSIONS In conclusion, this report validates a new and more naturalistic method for stimuli creation in animacy research, which allows for a clearer understanding between the movement of the geometric shapes and actual human motion. The results were consistent with previous findings suggesting that the perception of animacy can be powerfully determined by motion alone—even when the motions of real people are represented by single points. Moreover, this new technique enables exploration of a viewpoint that appears to significantly influence the ability to understand such minimal displays of human activity. This research has bearing on our understanding of the visual cues involved in the attribution of social intention to the movement of others. AUTHOR NOTE This work was sponsored by EPSRC Grant GR/P02899/01. Thanks to Kerri Johnson for helpful advice. Thanks also to Antonio Camurri, Gaultiero Volpe, and Barbara Mazzarino of the InfoMus Lab, University of Genoa, Italy. Work related to the EU TMR MOSART (2002–2003) grant, held by A. Camurri, was the impetus for much of the programming of EyesWeb used in this report. Correspondence concerning this article should be addressed to P. McAleer, Department of Psychology, University of Glasgow, 58 Hillhead St., Room 408, Glasgow G12 8QB, Scotland (e-mail:
[email protected]). REFERENCES Barrett, H. C., Todd, P. M., Miller, G. F., & Blythe, P. W. (2005). Accurate judgments of intention from motion cues alone: A crosscultural study. Evolution & Human Behavior, 26, 313-331. Bassili, J. N. (1976). Temporal and spatial contingencies in the perception of social events. Journal of Personality & Social Psychology, 33, 680-685. Blakemore, S.-J., Boyer, P., Pachot-Clouard, M., Meltzoff, A., Segebarth, C., & Decety, J. (2003). The detection of contingency and animacy from simple animations in the human brain. Cerebral Cortex, 13, 837-844. Bloom, P., & Veres, C. (1999). The perceived intentionality of groups. Cognition, 71, B1-B9. Blythe, P. W., Todd, P. M., & Miller, G. F. (1999). How motion reveals intention: Categorizing social interactions. In G. Gigerenzer, P. M. Todd, & the ABC Research Group (Eds.), Simple heuristics that make us smart (pp. 257-285). New York: Oxford University Press. Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10, 433-436. Camurri, A., De Poli, G., Leman, M., & Volpe, G. (2001). A multilayered conceptual framework for expressive gesture applications. In Proceedings of the International MOSART Workshop on Current Directions in Computer Music (pp. 29-34). Barcelona: Pompeu Fabra University, Audiovisual Institute. Camurri, A., Krumhansl, C. L., Mazzarino, B., & Volpe, G. (2004). An exploratory study of anticipating human movement in dance. In Proceed-
838
MCALEER AND POLLICK
ings of the 2nd International Symposium on Measurement, Analysis and Modeling of Human Functions (pp. 57-60). Piscataway, NJ: IEEE Press. Camurri, A., Lagerlöf, I., & Volpe, G. (2003). Recognizing emotion from dance movement: Comparison of spectator recognition and automated techniques. International Journal of Human–Computer Studies, 59, 213-225. Camurri, A., Mazzarino, B., & Volpe, G. (2004). Analysis of expressive gestures: The EyesWeb expressive gesture processing library. In A. Camurri & G. Volpe (Eds.), Gesture-based communication in human–computer interaction (pp. 460-467). Berlin: Springer. Camurri, A., Mazzarino, B., Volpe, G., Morasso, P., Priano, F., & Re, C. (2003). Application of multimedia techniques in the physical rehabilitation of Parkinson’s patients. Journal of Visualization & Computer Animation, 14, 269-278. Camurri, A., Trocca, R., & Volpe, G. (2002). Interactive systems design: A KANSEI-based approach. In Proceedings of the 2002 Conference on New Interfaces for Musical Expression (pp. 155-162). Limerick, Ireland: University of Limerick, Department of Computer Science and Information Systems. Castelli, F., Frith, C., Happé, F., & Frith, U. (2002). Autism, Asperger syndrome and brain mechanisms for the attribution of mental states to animated shapes. Brain, 125, 1839-1849. Csibra, G., Gergely, G., Bíró, S., Koós, O., & Brockbank, M. (1999). Goal attribution without agency cues: The perception of “pure reason” in infancy. Cognition, 72, 237-267. Gelman, R., Durgin, F., & Kaufman, L. (1995). Distinguishing between animates and inanimates: Not by motion alone. In D. Sperber, D. Premack, & A. J. Premack (Eds.), Causal cognition: A multidisciplinary debate (pp. 150-184). Oxford: Oxford University Press, Clarendon Press. Gergely, G., Nádasdy, Z., Csibra, G., & Bíró, S. (1995). Taking the intentional stance at 12 months of age. Cognition, 56, 165-193. Heider, F., & Simmel, M. (1944). An experimental study of apparent behavior. American Journal of Psychology, 57, 243-259. Kuhlmeier, V., Wynn, K., & Bloom, P. (2003). Attribution of dispositional states by 12-month-olds. Psychological Science, 14, 402-408.
McAleer, P., Mazzarino, B., Volpe, G., Camurri, A., Smith, K., Paterson, S., & Pollick, F. E. (2004). Perceiving animacy and arousal in transformed displays of human interaction. In Proceedings of the 2nd International Symposium on Measurement, Analysis and Modeling of Human Functions (pp. 67-71). Piscataway, NJ: IEEE Press. Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10, 437-442. Scholl, B. J., & Tremoulet, P. D. (2000). Perceptual causality and animacy. Trends in Cognitive Sciences, 4, 299-309. Szego, P. A., & Rutherford, M. D. (2007). Actual and illusory differences in constant speed influence the perception of animacy similarly. Journal of Vision, 7(12): 5, 1-7. Timmers, R., Marolt, M., Camurri, A., & Volpe, G. (2006). Listeners’ emotional engagement with performances of a Skriabin etude: An explorative case. Psychology of Music, 34, 481-510. Tremoulet, P. D., & Feldman, J. (2000). Perception of animacy from the motion of a single object. Perception, 29, 943-951. Watson, A. B., & Hu, J. (1999). ShowTime: A QuickTime-based infrastructure for vision research displays. Perception, 28(Suppl.), 45. Zacks, J. M. (2004). Using movement and intentions to understand simple events. Cognitive Science, 28, 979-1008. NOTES 1. Courting was changed to flirting after discussion. It was felt that flirting was a term people would be more familiar with and could easily describe. 2. An occlusion cue provides ordinal depth information only in side view displays; however, it provides identity to the circles in the displays of both viewpoints. 3. Fifteen additional participants were recruited and shown the six intentions in all the experimental conditions. Response was a twoalternative forced choice of viewpoint, overhead versus side.
APPENDIX The EyesWeb Open Platform for Multimedia Application and Motion Analysis Fundamental to the creation of animate stimuli in this study was the extraction of positional coordinates of actors from video displays across time. To achieve this, we made use of the EyesWeb open platform for multimedia application and motion analysis (www.eyesweb.org), developed by the InfoMus Lab at the University of Genoa, Italy (Camurri, De Poli, Leman, & Volpe, 2001; Camurri, Trocca, & Volpe, 2002). EyesWeb was designed with the intention of creating a tool that could perform real-time analysis of full-body gestures and movements of one or more persons at a time. Of particular interest in the original design of the program was the extraction of high-level parameters of expressive intentions in a performance; for example, the developers wanted to create a system that was capable of distinguishing between two performances of the same movement that differed on the emotion expressed in the movement. Hitherto, the dominant use of EyesWeb has been to look at the success of musical performers or dancers in modern dance at expressing emotional content to audiences. The EyesWeb system has thus far been successfully implemented in numerous theater and museum exhibits (Camurri, Mazzarino, & Volpe, 2004), and research is now underway with regard to assessing the validity of this system in the assistance of the treatment of people with disorders of the motor system, such as Parkinson’s disease (Camurri, Mazzarino, Volpe, et al., 2003). EyesWeb is a visual programming language that consists of a development area with an accessible set of libraries containing software modules that can be used repeatedly and can be interconnected with each other to create a processing patch, or series of modules, for motion analysis. A screen shot of an EyesWeb patch can be seen in Figure A1. The image shows an EyesWeb patch that is designed to track two actors in an excerpt of video footage. The video file is imported at Point A, with the background being removed at Point B. Between Points B and C, the coordinates are tracked for each actor, and a red square is drawn around the center of each actor, so that the people are now represented by these shapes, accurately following the movement. In the present experiments, we extracted the coordinates from the patch at the parts circled in red. We then made use of MATLAB (The MathWorks, Natick, MA) for greater control on the final animacy display. The remainder of the patch, between the red circles and Point C, was not used in the present research; however, it can used for outputting video displays in EyesWeb. EyesWeb has previously been successfully implemented in studies of human motion. Timmers, Marolt, Camurri, and Volpe (2006) and Camurri, Lagerlöf, and Volpe (2003) employed EyesWeb to examine the emotional percepts of audiences of piano and dance performances, respectively. Camurri, Lagerlöf, and Volpe explored the cues that are important for the recognition of emotions in dance and compared the results for human spectators with those of automatic techniques for recognizing emotions. Finally, Camurri, Krumhansl, et al. (2004) performed a study looking at anticipating human movement in dance. They examined whether stopping a dance display at the midpoint of a segment would affect participants’ ability to accurately judge where the dancer would finish
INTENTION FROM MINIMAL DISPLAYS
839
APPENDIX (Continued) the movement. Another aspect of the study was to look at the saliency of the barycenter of the dancer as a cue to movement. The barycenter is described as the first-order moment of the 2-D silhouette of an actor and is an approximation of the center of mass. It is a means of obtaining a single measure of the combined locations of the torso and limbs. The results indicated that the participants’ ability to judge the end position of the barycenter did improve. The authors concluded that this experiment validated the use of the barycenter as a point of information in movement analysis, suggesting that observers can make use of the barycenter to judge the motion of a person. In the present study, EyesWeb was used to extract the positional coordinates of actors across time from video recordings of their movements. The extracted coordinates represented the barycenter of the actors; thus, when the actors were viewed from the overhead position, this was approximately the center of the heads of the actors, and when they were viewed from the side view, this was
A
B
C
A
B
C
Figure A1. EyesWeb screen shot showing the interlinked software modules that complete a patch for the conversion of video footage of two actors into moving geometric shapes. The original video enters the process at Position A, where it is quickly converted to the silhouette image via background subtraction (Position B). From there, the number of silhouettes is counted and tracked at the barycenter, and the resultant image is created in which the actors are represented by squares (Position C). The points in the patch at which the coordinates for each actor are extracted, as in the present research, are circled. The section between the circles and Point C is not used in the present experiments but does show the output capabilities of EyesWeb.
approximately the stomachs of the actors. We then used these positional coordinates to create animacy displays in which the actors were represented by geometric shapes, via MATLAB. EyesWeb has shown itself to be a useful tool for the extraction of coordinates of human actors from video recordings of movements, particularly in the research presented. It is continually developing and being developed not only by its creators, but also by its expanding user group. Although this report deals only with dyadic interactions, it is now possible to use EyesWeb to track multiple actors at one time. It is clear that future research will benefit from such an adaptable program and, in turn, will reveal further potentials of this program in the field of understanding intentions from human movement. (Manuscript received November 23, 2007; accepted for publication February 1, 2008.)