Interactions of language and vision restrict" visual world" interpretations

27 downloads 0 Views 358KB Size Report
Martin J. Pickering1, Brian McElree*, & Simon Garrod3. 1 Dept of Psychology, University of Edinburgh, 7 George Square, Edinburgh, UK EH8 9JZ. * Dept of ...
Interactions of language and vision restrict "visual world" interpretations Martin J. Pickering1, Brian McElree*, & Simon Garrod3

1

Dept of Psychology, University of Edinburgh, 7 George Square, Edinburgh, UK EH8 9JZ

*

Dept of Psychology, New York University, 6 Washington Place, NY, NY, USA 10003

3

Dept of Psychology, University of Glasgow, 58 Hillhead Street, Glasgow, UK G12 8QQ

Address Correspondence to: Brian McElree Cognition and Perception Program Department of Psychology New York University 6 Washington Place, Room 860 New York, NY 10003 USA

Abstract: The “visual-world” paradigm has had an enormous impact on recent language processing research. Although we welcome the new method, we argue that it does not provide a transparent “window” on language processing. Using examples of recent visualworld studies we show how information in the visual scene inevitably influences the processing of utterances about that scene. We conclude that findings from visual-world experiments may not always generalize to language processing in the absence of visual support.

Language is often used to talk about matters present at hand, as when people discuss and refer to objects or events in their field of vision. Much of the power of language, however, derives from the fact that it also enables one to transcend the immediate environment by talking about displaced objects and events, ones that might be in the future, the past, or somewhere distant. When language refers exclusively to matters at hand, comprehenders face a different set of challenges than when language concerns displaced objects and events. Psychological investigations of language comprehension have typically examined the processing of isolated expressions or expressions in a purely linguistic context, using either speech or text and one of a wide range of techniques, including priming, word and phoneme monitoring, eye-tracking, and event-related brain potentials [1]. Recently, however, investigators have begun to use contexts that are both linguistic and visual. In a seminal paper, Tanenhaus et al. [2] introduced a paradigm in which participants listened to instructions that were initially consistent with two syntactic analyses while they viewed a scene containing a small number of objects. For example, they were asked to Put the frog on the napkin in the box when there was either one or two frogs in the scene. Analyses of eye movements demonstrated that participants tended to look at the frog which was on a napkin more if there was also a frog that was not on a napkin than if there was no other frog present. From this they inferred that the visual context could drive syntactic disambiguation and argued for a highly interactive model of comprehension, in which non-linguistic factors play a major role in the early stages of processing [3]. This paradigm contrasts dramatically with “traditional” eye tracking, in which gaze patterns on written words are used to draw conclusions about processing [4]. Since the publication of Tanenhaus et al. [2], there have been an enormous number of influential studies using this "visual-world" paradigm to evaluate models of auditory word recognition [5-13], syntactic processing [14-19], anaphoric resolution [20, 21], referential processing [22-24], disfluencies [25], prosody [26], and dialogue [27, 28]. It has also been applied to topics in language acquisition, including word learning [10], early sentence comprehension [29, 30], and development of dialogue skills [31]. This paradigm provides a welcome new methodology but does not provide a transparent “window” on language comprehension. This is because the presence of visual objects, particularly a few salient ones, fundamentally changes the computations involved in comprehension. Consider the following simple example. If people encounter the passage in (1), they tend to interpret a starving beast as referring to Fido, because, presumably, they prefer coreference to positing a new discourse entity. Page 2

1. In the morning Harry let out his dog Fido. In the evening he returned to find a starving beast. Would the same be true if people were instructed to look at Figure 1 before encountering (1)? Now readers may be much more likely to interpret a starving beast as referring to the lion rather than to Fido. The situation is different because people do not need to posit a new entity – the entity is present at hand.

Figure 1. Objects that might be presented to participants in a visual-world experiment prior to hearing the utterance in (1).

Pictures or their real-world counterparts make an important contribution to understanding in such situations. They do not merely serve as an index of what has taken place during language comprehension but combine with the language to determine interpretation, and in doing so they modify the nature of the computations involved. If eye movements were monitored while (1) was spoken, we would probably find that people tended to look at the lion in Figure 1 after hearing a starving beast, perhaps even more than the picture of the dog. Following standard reasoning employed in visual-world experiments, we would conclude that listeners considered a starving beast as referring to the lion, and that they perhaps eschew treating the phrase as coreferential with Fido. Clearly, we would be wrong to assume that this was generally true of the text in (1) presented without the accompanying visual objects. Instead, introducing the picture changes the computation.

Page 3

The visual-world paradigm More generally, the visual-world paradigm requires the following. The experiment must involve objects (or pictures of objects) related in some way to the utterance used in the experiment. Participants must know about these objects and their locations in advance of encountering the critical point in the utterance, so that eye movements do not reflect general exploratory behavior and so that participants can rapidly fixate objects in response to the utterance. Researchers draw inferences about the time-course of understanding an utterance by assuming that gazes to different objects simply index the processing that underlies the understanding of the utterance. To take findings from this paradigm as broadly representative of how language is processed, researchers must assume further that gaze patterns index processing that would have taken place were the visual objects not present. However, as the lion example above indicates, and as we illustrate in Boxes 1-3, this assumption is often dubious. Participants interpret the objects, either visually or via linguistic coding, before interpreting the relevant part of the utterance. Crucially, the presence of the objects is likely to transform the understanding process. Participants do not simply comprehend the utterance, as in traditional paradigms, but integrate their interpretation of an utterance and the relevant objects in the visual environment. Comprehension in the presence of an object is not the same as comprehension in its absence. Clear support for this claim is evident in research on reading and in language production during dialogue. Reading is radically affected by pictures presented with the text [32, 33]. When interlocutors share a visual scene, the linguistic content of the interaction radically changes from a situation where the scene is visible to only one interlocutor [34], presumably because the visual context provides much of the “grounding” that would otherwise have to be conveyed in words [35]. For example, speakers typically do not describe objects but rather point to them while using deictic expressions like this one, that one. The paradigm is not of course invalidated because the presence and nature of the visual objects transforms the understanding process. However, it does entail that researchers must be acutely aware of its limitations, and that caution needs to be exercised in drawing general conclusions from gaze patterns. To illustrate, we take three examples of topics addressed with the method, chosen because of their representativeness and influence: anaphoric resolution (Box 1), syntactic processing (Box 2), and lexical access (Box 3). These examples illustrate concerns that the visual-world paradigm, when applied to core issues in language comprehension, may in some cases engender unrepresentative findings. We note three specific reasons for why this may be the case. First, visual stimuli can combine with linguistic stimuli in a manner that effectively constrains the referential domain in ways that may be unrepresentative of the challenges that readers and listeners face when dealing language about displaced objects and events (Boxes 1 and 2). Second, the limited response set in visual-world experiments enables participants to form strategies that are optimal for the task demands of the experiment but that may be suboptimal or inappropriate for general communicative demands (Box 2). Finally, the dependent measure – gazes to target objects – can be driven by factors that are unrelated to the facet of the language comprehension system at issue (Boxes 1 and 3). Interested readers are encouraged at this point to consider the examples in Boxes 1-3. Page 4

Concerns, limitations, and unresolved issues These examples illustrate the general point that researchers cannot use the results of visual-world experiments to make straightforward inferences about the comprehension of all types of language. The technique is of course wholly appropriate if the goal of the study is to investigate language use in highly constrained circumstances, as when investigating the processes underlying interactive dialogues about objects in the interlocutors' environment [27]. However, additional research using alternative methods will be needed to determine whether such findings generalize to dialogues about absent or abstract objects and events. We suspect that intrinsic properties of the paradigm may make it ill suited to investigating certain facets of language use. For example, comprehenders routinely engage in backward or bridging inferencing to establish coherence between old and new information [36] but less often engage in forward or elaborative inferencing, where coherence is not at stake [37]. However, by introducing a picture before the text is presented, this task turns what otherwise would be a type of (optional) forward-inferencing process into a backward-inferencing process, because participants attempt to make the picture and the text cohere. For example, in Altmann and Kamide’s study [14; Box 2], participants hear The boy will eat and then attempt to make it cohere with the picture of the cake. Additionally, we have raise two general methodological concerns that require further investigation. First, the observations themselves (i.e., the patterns of eye gazes) may be driven by low-level relations between the words in the utterance [e.g., beast in (1)] and objects in the visual world, and they may not reflect the deeper computations involved in interpreting the utterance that the researcher intended to investigate (e.g., the anaphoric processes involved in processing In the evening he found a starving beast). Many effects observed in this paradigm may be partly the result of participants’ regularly naming the objects covertly. When this happens, there may be a particularly direct route between the linguistic form of the object and the utterance. Further research is needed to determine the extent to which visual-world results depend upon linguistic recoding (covert naming) of the objects (See Box 4 for other suggestions for future research.) Second, the effects of the objects on processing an utterance may be exacerbated by the small number of objects presented (typically about four). Presumably, small sets are necessary because larger sets would increase memory demands (both for the retention of the objects and their locations) and would dilute effects. However, small sets enable particular strategies that may be optimal for the experimental task but not necessarily representative of general operations involved in processing language. For instance, participants may circumvent standard processing operations by developing strategies based on a limited number of representations held in working memory. Immediate effects that depend on properties of the paradigm cannot be used to provide strong support for interactive theories of comprehension. One of the clear strengths of the visual-world paradigm is its potential to precisely measure the time-course of processing an utterance. If participants see related objects before a point of interest in the utterance, this procedure can provide fine-grained measures of how Page 5

listeners incrementally integrate the information from the utterance with the information from the object. However, no matter how robust, reliable, or precise those measures might be, we cannot necessarily use such results to directly infer what happens during comprehension of an utterance in isolation or in a less constrained environment. As with any behavioral measure, convergent methods are essential to establish the generality of findings and, in this case, to specifically determine whether the conclusions drawn from this paradigm are applicable to language about displaced objects and events.

Page 6

_____________________ Box 1: Anaphoric Resolution Arnold et al. [20] asked participants to judge whether a text like (2) matched a picture in which either Donald or one of the mice is carrying the umbrella: 2. Donald is bringing some mail to Mickey/Minnie. He’s sauntering down the hill while a violent storm is beginning. He’s/She’s carrying an umbrella and it looks like they’re both going to need it. When either Donald or Minnie was carrying the umbrella, participants looked at that character immediately after encountering the second pronoun (he/she). They argued that that the pronoun was immediately resolved if it referred to the linguistically prominent entity (Donald) or a non-prominent entity that is gender differentiated (Minnie). But when Mickey was carrying the umbrella, participants did not look at Mickey in preference to Donald. This suggests that reference did not tend to occur under such conditions (and over 40% of the time participants reported that the text did not match the picture). They concluded that the non-prominent entity was extremely hard to identify, but a disambiguating gender cue entirely removed this difficulty. In contrast, text-comprehension experiments find that resolving pronominal reference to non-prominent characters is difficult. In a representative study, a text mentioning Elizabeth (prominent because it is a name and occurred first) and a male lifeguard, then referring to Elizabeth as she (thereby reinforcing prominence), caused immediate processing difficulty if it continued with he [38]. This is not surprising, as it corresponds to a basic principle of good writing, namely that pronouns should only be used to prominent entities. Why did Arnold et al. fail to find a prominence effect? Presumably, participants heard she and looked at the picture of Minnie, because Minnie was the only female character in the picture. They used the picture to guide their interpretation of she and may well not have interpreted she anaphorically (i.e., in relation to the word Minnie). This study illustrates how use of the visual-world paradigm can obscure effects of linguistic processing.

Page 7

___________________ Box 2: Syntactic Processing Several studies of syntactic processing suggest that the visual-world paradigm may induce anticipatory strategies that would not occur without the displayed objects and which can even be interpreted as guessing behavior. Altmann and Kamide [14; also 16] had participants listen to sentences like the boy will eat the cake or the boy will move the cake while viewing a scene containing a boy, a cake, and three other entities, and judge whether the sentence could apply to the entities. The verb eat requires an edible grammatical object, whereas move does not. Participants looked at the cake picture more with the verb eat than with the verb move well before the onset of the cake. Altmann and Kamide argued that participants anticipated the semantic characteristics of the grammatical object on the basis of the verb. However, the picture appeared at the start of the sentence, so participants could encode the scene before they heard the verb. Hence, participants were not inferring that the grammatical object would be edible on the basis of the boy will eat alone. Rather, they were likely to have guessed this on the basis of this sentence fragment plus the scene, which critically includes the picture of the cake. It is unclear what these findings tell us about the semantic information associated with the verb itself. Sussman and Sedivy [18] had participants view displays containing four objects while listening to narratives mentioning those objects. Participants then answered crucial “WHquestions” (e.g., What did Jody squash the spider with?), which contained a long-distance syntactic dependency (here, between what and with), or control questions. Because participants looked disproportionally at the spider picture after What did Jody squash, Sussman and Sedivy argued that they initially interpreted this fragment as a question about what was squashed, which in this case was a spider. This interpretation is reasonable because the verb licenses possible syntactic roles for the WH-element and because results from other paradigms indicate that people anticipate the form of the question at the verb [39, 40]. However, increased looks to the spider were actually evident well before the verb, which suggests that participants guessed the form of the question from the highly constrained visual environment and limited set of referents, not from crucial properties of the linguistic input. Consequently, it is unclear what these results tell us about how longdistance dependencies are generally resolved.

Page 8

Box 3: Lexical access The visual-world paradigm has been productively applied to several issues in spoken word recognition [5-13]. For example, Allopenna et al. [5] tracked participants' looks to computer displays in which some of the pictured objects had similar names (e.g., beaker, beetle, speaker), as participants followed spoken instructions to move the objects with a mouse. Between 200-300 ms after the onset of the name, participants began looking at the named object (e.g., beaker), to competitors that shared an initial segment (e.g., beetle), and, to a lesser extent, to rhymes of the name (e.g., speaker). Shortly after the onset of the name, looks to the competitor were as frequent as to the target, but these looks diminished later in processing. This suggests that alternative representations of the stimulus initially compete, with all but the most appropriate being eventually inhibited. The form of the function accords with spoken word recognition models like TRACE [41]. Two studies [8, 10] have demonstrated that looks to the visual target are delayed when another word temporarily matches the spoken word, even when this competitor was not displayed or mentioned in the experiment. This is an important finding because it demonstrates that the latency of looks to the target item is responsive to factors related to the listeners' mental lexicon in general rather than just the names of objects in the display. Analogous demonstrations are essential in other domains in which researchers have applied the visual-world paradigm [14-31] if the results of such studies are to advance our understanding of general features of the language processing system. Results like these suggest that the visual-world paradigm may be particularly well suited to exploring some of key issues in spoken word recognition. Nevertheless, they do not legitimize all of the conclusions researchers have drawn in this domain. Notable among these are the competitor effects described above. When the competitor is visually present, processing may be affected by interpretation of the competitor, and priming of the competitor’s representation may be largely responsible for the high number of looks to competitor object in early phases of processing rather than cohort competition. Support for cohort competition is found in other paradigms [42], but so is the idea that visual representations prime lexical representations [43, 44]. As yet, it is unclear what proportion of the looks to competitor objects is due to cohort competition and what to priming as a result of processing of the visual array.

Page 9

Insert 1 Questions for future research •

What are the precise differences between language processing with and without a visual context?



To what extent is the visual-world paradigm like a memory probe paradigm?



To what extent do visual-world results depend upon linguistic recoding (covert naming) of the objects?



Up till now visual-world studies have concentrated on the initial stages of language processing. To what extent does the visual context affect recovery from misanalysis?

Page 10

References 1 Garrod, S. and Pickering, M. J. (1999) Language Processing, Psychology Press. 2 Tanenhaus, M.K., et al. (l995) Integration of visual and linguistic information in spoken language comprehension. Science, 268, 1632-1634. 3 MacDonald, M. C., et al. (1994). The lexical nature of syntactic ambiguity resolution. Psychological Review, 101, 676-703. 4 Rayner, K. (1998) Eye movements in reading and information processing: twenty years of research. Psychological Bulletin, 124, 372-422. 5 Allopenna, P.D. et al. (1998) Tracking the time course of spoken word recognition: evidence for continuous mapping models. Journal of Memory and Language, 38, 419-439. 6 Dahan, D. et al. (2000) Linguistic gender and spoken-word recognition in French. Journal of Memory and Language, 42, 465-480. 7 Dahan, D. et al. (2001) Subcategorical mismatches and the time course of lexical access: evidence for lexical competition. Language and Cognitive Processes, 16, 507-534. 8 Dahan, D. et al. (2001) Time course of frequency effects in spoken-word recognition: Evidence from eye movements. Cognitive Psychology, 42, 317-367. 9 Dahan, D. et al. (2002) Accent and reference resolution in spoken-language comprehension. Journal of Memory and Language, 47, 292-314. 10 Magnuson, J. S. et al. (2003) The time course of spoken word learning and recognition: studies with artificial lexicons. Journal of Experimental Psychology: General, 132, 202-227. 11 McMurray, B. et al. (2002) Gradient effects of within-category phonetic variation on lexical access. Cognition, 86, B33-B42. 12 McMurray, B. et al. (2003) Probabilistic constraint satisfaction at the lexical/phonetic interface: Evidence for gradient effects of within-category VOT on lexical access. Journal of Psycholinguistic Research, 32, 77-97. 13 Spivey, M.J. and Marian, V. (1999) Cross talk between native and second languages: partial activation of an irrelevant lexicon. Psychological Science, 10, 281-284. 14 Altmann, G. T. M. and Kamide, Y. (1999) Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition, 73, 247-264 15 Eberhard, K.M. et al. (1995) Eye-movements as a window into real-time spoken language comprehension in natural contexts. Journal of Psycholinguistic Research, 24, 409-436. 16 Kamide, Y. et al. (2003) The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye movements. Journal of Memory and Language, 49, 133-156. 18 Spivey, M.J. et al. (2003) Eye movements and spoken language comprehension: effects of visual context on syntactic ambiguity resolution. Cognitive Psychology, 45, 447481. 17 Spivey, M.J. and Tanenhaus, M.K. (1998) Syntactic ambiguity resolution in discourse: modeling the effects of referential context and lexical frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1521-1543.

Page 11

18 Sussman, R.S. and Sedivy, J.C. (2003) The time-course of processing syntactic dependencies: evidence from eye movements. Language and Cognitive Processes, 18, 143-163. 19 Tanenhaus, M.K. et al. (2000) Eye movements and lexical access in spoken-language comprehension: evaluating a linking hypothesis between fixations and linguistic processing. Journal of Psycholinguistic Research, 29, 557-580. 20 Arnold, J.E. et al. (2000) The rapid use of gender information: evidence of the time course of pronoun resolution from eyetracking. Cognition, 76, B13-B26. 21 Runner, J.T. et al. (2003) Assignment of reference to reflexives and pronouns in picture noun phrases: evidence from eye movements. Cognition, 89, B1-B13. 22 Chambers, C.G. et al. (2002) Circumscribing referential domains during real-time language comprehension. Journal of Memory and Language, 47, 30-49. 23 Hanna, J.E. et al. (2003) The effects of common ground and perspective on domains of referential interpretation. Journal of Memory and Language, 49, 43-61. 24 Sedivy, J.C. et al. (1999) Achieving incremental semantic interpretation through contextual representation. Cognition, 71, 109-147. 25 Arnold, J.E. et al. (2003) Disfluencies signal theee, um, new information. Journal of Psycholinguistic Research, 32, 25-36. 26 Snedecker, J. and Trueswell, J. (2003) Using prosody to avoid ambiguity: effects of speaker awareness and referential context. Journal of Memory and Language, 48, 103-130. 27 Brown-Schmidt, S. et al. (in press) Real-time reference resolution in a referential communication task. In Processing world-situated language: Bridging the language as product and language as action traditions (Trueswell, J. C. and Tanenhaus, M. K., eds.), MIT Press. 28 Metzing, C. and Brennan, S. E. (2003) When conceptual pacts are broken: partnerspecific effects on the comprehension of referring expressions. Journal of Memory and Language, 49, 201-213. 29 Hurewitz, F. et al. (2000) One frog, two frog, red frog, blue frog: factors affecting children's syntactic choices in production and comprehension. Journal of Psycholinguistic Research, 29, 597-626. 30 Trueswell, J.C. et al. (1999). The kindergarten-path effect: studying on-line sentence processing in young children. Cognition, 73, 89-134. 31 Nadig, A.S. and Sedivy, J.C. (2002) Evidence of perspective-taking constraints in children's on-line reference resolution. Psychological Science, 13, 329-336. 32 Rayner, K. et al. (2001) Integrating text and pictorial information: eye movements when looking at print advertisements. Journal of Experimental Psychology: Applied, 7, 219-226. 33 Underwood, G. et al. (2004) Inspecting pictures for information to verify a sentence: eye movements in general encoding and in focused search. Quarterly Journal of Experimental Psychology: Human Experimental Psychology, 57, 165-182. 34 Clark, H. H. and Krych, M. A. (2004) Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50, 62-8. 35 Clark, H. H. and Marshall, C. (1981) Definite reference and mutual knowledge. In Elements of Discourse Understanding (Joshi, A.K. et al. eds), pp. 10-63, Cambridge University Press. Page 12

36 Clark, H., and Haviland, S. (1977) Comprehension and the given-new contrast. In Discourse Production and Comprehension (Freedle, R., ed.), 1-40, Erlbaum. 37 Garrod, S. et al. (1990) Elaborative inferencing as an active or passive process. Journal of Experimental Psychology, 16, 250-257. 38 Garrod, S. et al. (1994) The role of different types of anaphor in the on-line resolution of sentences in a discourse. Journal of Memory and Language, 33, 39-68. 39 Stowe, L.A. (1986) Parsing WH-constructions: evidence for on-line gap location. Language and Cognitive Processes, 1, 227-245. 40 Traxler, M. J. and Pickering, M. J. (1996) Plausibility and the processing of unbounded dependencies: an eye-tracking study. Journal of Memory and Language, 33, 39-68. 41 McClelland, J.E. and Elman, J.L. (1986) Interactive processes in speech perception: The TRACE model. In Parallel Distributed Processing: Explorations in the microstructure of cognition. (Rumelhart, D. and McClelland, J. l., eds. Vol. 2), 58121, MIT Press. 42 Luce, P.A. and Pisoni, D.B. (1998) Recognizing spoken words: the neighborhood activation model. Ear and Hearing, 19, 1-36. 43 Glaser, W. R. and Dungelhoff, F. J. (1984) The time course of picture-word interference. Journal of Experimental Psychology: Human Perception and Performance, 10, 640654. 44 Bloom, I. and La Heij, W. (2002) Semantic facilitation and semantic interference in word translation: implications for models of lexical access in language production. Journal of Memory and Language, 48, 468-488.

Page 13

Suggest Documents