Attention to touch weakens audiovisual speech integration - MIT

1 downloads 0 Views 317KB Size Report
Sep 25, 2007 - Attention to touch weakens audiovisual speech integration. Agnès Alsius · Jordi Navarra · Salvador Soto-Faraco. Received: 29 January 2007 ...
Exp Brain Res (2007) 183:399–404 DOI 10.1007/s00221-007-1110-1

R ES EA R C H N O T E

Attention to touch weakens audiovisual speech integration Agnès Alsius · Jordi Navarra · Salvador Soto-Faraco

Received: 29 January 2007 / Accepted: 14 August 2007 / Published online: 25 September 2007 © Springer-Verlag 2007

Abstract One of the classic examples of multisensory integration in humans occurs when speech sounds are combined with the sight of corresponding articulatory gestures. Despite the longstanding assumption that this kind of audiovisual binding operates in an attention-free mode, recent Wndings (Alsius et al. in Curr Biol, 15(9):839–843, 2005) suggest that audiovisual speech integration decreases when visual or auditory attentional resources are depleted. The present study addressed the generalization of this attention constraint by testing whether a similar decrease in multisensory integration is observed when attention demands are imposed on a sensory domain that is not involved in speech perception, such as touch. We measured the McGurk illusion in a dual task paradigm involving a diYcult tactile task. The results showed that the percentage of visually inXuenced responses to audiovisual stimuli was reduced when attention

A. Alsius · S. Soto-Faraco Departament de Psicologia Bàsica, Universitat de Barcelona, Pg. Vall d’Hebrón 171, Barcelona 08035, Spain e-mail: [email protected] A. Alsius · J. Navarra · S. Soto-Faraco (&) Parc CientíWc de Barcelona, Hospital Sant Joan de Déu (EdiWci Docent), c/ Santa Rosa, 39-57, Planta 4ª, 08950 Esplugues de Llob regat (Barcelona), Spain e-mail: [email protected] J. Navarra Department of Experimental Psychology (Crossmodal Research Laboratory), University of Oxford, Oxford, UK e-mail: [email protected] S. Soto-Faraco Insitució Catalana de Recerca i Estudis Avançats, Barcelona, Spain

was diverted to a tactile task. This Wnding is attributed to a modulatory eVect on audiovisual integration of speech mediated by supramodal attention limitations. We suggest that the interactions between the attentional system and crossmodal binding mechanisms may be much more extensive and dynamic than it was advanced in previous studies. Keywords Attention · Multisensory integration · Speech perception · Touch

Introduction One of the most cited examples of the automatic, attention free, nature of multisensory integration is audiovisual speech processing. The ability to match heard vocalizations with seen facial gestures appears early in life (e.g., Burnham and Dodd 2004; Kuhl and MeltzoV 1982), is displayed by nonhuman primates (Ghazanfar and Logothetis 2003), and even across species (Lewkowicz and Ghazanfar 2006). This, together with the striking phenomenology of illusions such as the McGurk eVect (McGurk and MacDonald 1976), whereby the view of a speech gesture leads to an altered acoustic experience, suggests that multisensory integration might be largely immune to top–down modulations. Previous studies supporting the pre-attentive nature of audiovisual speech integration relied on explicit instructions to direct observers attention to one or another sensory modality while assessing the prevalence of the McGurk eVect (e.g., Massaro, 1987), on diverting attention from the audiovisual event indirectly (Soto-Faraco et al. 2004), or on measuring the electrophysiological correlates of audiovisual speech (e.g., Colin et al. 2002). In stark contrast with claims of automaticity, however, Alsius et al. (2005) found that the percentage of illusory McGurk responses decreased dramatically when participants

123

400 were concurrently performing an unrelated visual or auditory demanding task. This result casts some doubts about previous assumptions of automatic attention-free multisensory integration, and conforms well to the perceptual load theory (Lavie 1995) whereby the extent of audiovisual integration would depend on available attentional resources. Outside the domain of speech, it has been often claimed that, under the appropriate stimulus conditions, crossmodal binding arises independently from observer’s attentional state (Bertelson et al. 2000; Bertelson and Radeau 1981; Pick et al. 1969; Vroomen et al. 2001a, b; see de Gelder and Bertelson 2003, for review). However, recent behavioural (Fujisaki et al. 2006), electrophysiological (Talsma and WoldorV 2005) and functional neuroimaging studies (Van Atteveldt et al. 2007; Degerman et al. 2007) have started to suggest that audiovisual integration can be modulated by top–down task demands, adding weight to the evidence that attention might indeed play a role in multisensory integration. The present study was designed to further explore the nature of the role played by attention on audiovisual speech integration. In particular, we addressed, for the Wrst time, whether attentional modulation of multisensory integration could be observed when attention-demands were imposed on a sensory modality that was not involved in the integration process. Attention to touch was manipulated while the degree of audiovisual integration (i.e., the prevalence of McGurk eVect) was measured. This experimental design allowed us to investigate the generality of the competition for attention resources potentially involved in multisensory integration. If the attentional resources modulating audiovisual integration of speech are speciWc to the modalities involved in the binding process, the McGurk eVect will occur regardless of the demands imposed on touch. Alternatively, if McGurk responses are reduced when participants perform the attention-demanding tactile task, it will be possible to conclude that the competition for attention resources occurs at a more central, hetero-modal level.

Experimental procedures Participants Thirty-two undergraduates (6 males; mean age = 21.5 years, ranging from 19 to 38) from University of Barcelona participated in the study in exchange for course credit. All of them reported normal hearing, normal or corrected vision, and normal tactile sensitivity. Stimuli The audiovisual stimuli consisted of a video clip featuring a female speaker (frontal view of the face; see Fig. 1) uttering

123

Exp Brain Res (2007) 183:399–404

Fig. 1 a Stimuli and task. All participants were required to repeat back verbally any words the actor on the screen said. In the dual task condition participants were also asked to attend to the stimuli in the tactile stream in order to respond as fast as possible whenever they detected that two successive pairs of Wngertips had been stimulated in a symmetrical fashion (see Experimental procedures). b Display condition. Each participant performed both tasks (Single and Dual) under two diVerent conditions: Half of the participants were presented with the audiovisual and the auditory displays (Group AV_A), whereas the other half was presented with the audiovisual and the visual conditions (Group AV_V)

a list of Spanish words at unpredictable times (every 21 s on average, §16 s). The audio and video channels had been cross-dubbed, with the acoustic and visual words—both meaningful—being identical except for one single phoneme, which was selected to give rise to the McGurk illusion when dubbed. Words diVered in the type of phonemes leading to McGurk eVect (that could be [g], [k] or [n] visually and /b/, /p/ or /m/ auditory) and in the position of mismatching phonemes within the word (beginning or middle). The result from the expected audiovisual combination— in case observers fused audiovisual information—was a meaningful word (in some pairs it was a new word, as in /bait/ + [gate] = “date,” whereas in others, it matched the visual word, as in /met/ + [net] = “net”; see Kaiser 2004; Massaro 1998; Tuomainen et al. 2005 for similar examples of McGurk stimuli, in which the perceived acoustic illusion

Exp Brain Res (2007) 183:399–404

matches the visual component).1 The diVerence in frequency between the words of each pair—and the expected fused response—was matched across the four word lists used (see below). The selection of the expected fused words was based on previous results using the same materials (Alsius et al. 2005) and on the most plausible phoneme that, according to previous literature, should arise from the integration of both inputs (see McGurk and MacDonald 1976; Green and Kuhl 1991). We built four audiovisual sequences containing 17 diVerent randomly ordered words each (13 McGurk combinations plus 4 Wllers with matching visual and audio words), interspersed among video recordings of the speaker silently looking at the camera. Three diVerent display conditions (visual-only, auditory-only, and audiovisual, see Fig. 1b) of each of the four equivalent word sequences were created, leading to a total of 12 sequences (4 auditory, 4 visual, 4 audiovisual; each 6 min duration approx.). The visual-only condition was produced by adding white noise to the sound-track at a SN ratio of ¡26 dB (thus eVectively rendering the auditory words unintelligible). In the auditory-only condition, a video quantization eVect was applied to degrade the image so that a detailed view of the lip movements was prevented whereas the overall features of the video display were preserved. The audiovisual condition contained both signals intact. Sequence versions were counterbalanced across participants, so each participant was presented with each word combination only once. Participants viewed the video-clip sequences, converted to uncompressed AVI digital Wles, from a distance of 30 cm on a 15⬙ TFT computer monitor that showed the speaker face at 50% of the full-screen. The audio channel was played through two loudspeakers located at both sides of the monitor, at a comfortable intensity (64 dB[A]). Tactile stimulation was delivered by four tappers (Solenoid Tactile Tapper, M&E Solve, UK) of 8 mm diameter driven by a 9 V signal of 30 ms duration that produced a tap with supra-threshold intensity. The tappers were arranged in a square (1 cm side) on a foam support located behind the computer monitor where the speech stimuli were presented. Participants looked in the direction of, but could not directly see, the stimulators nor their own hands. To mask any sounds made by the tactile tappers, white noise (40 dB[A]) was presented throughout the experiment from two loudspeakers placed behind the Xat screen.

1

Due to intrinsic constraints of the lexicon, there were just a few exemplars in our lists where a new word could be created by the given pair of dubbed words. For most of the pairs, the expected resulting combination between the acoustic and the visual stimulus would correspond to the visual word. For this reason, fusion and visual responses were treated together in the analyses.

401

Procedure All participants were tested in the dual and the single task condition.2 In order to include task as a within participant variable in the design while keeping the experiment at a reasonable duration, all participants were tested under the audiovisual condition and just one of the two possible unimodal conditions (half of the participants were tested with the auditory displays, whereas the other half with the visual displays). Each participant was then tested in four blocks, with the order of task (single, dual) counterbalanced across participants, and the order of display condition (AV vs. unimodal) counterbalanced within each task block. All participants were instructed to Wxate the speaker on the monitor while placing the index and middle Wngertips of their left and right hands on the tactile tappers, and to repeat back verbally any words the speaker said. In the dual task blocks participants were asked, in addition, to respond (via a foot pedal placed under their dominant foot) to tactile targets interspersed amongst a stream of tactile events (one each 1.2 s). Each tactile event consisted of two of the four tappers (chosen at random) activated for 30 ms, the targets (one each 7.2 s on average) being tactile events that were spatially symmetrical with the preceding stimulus; that is, if the stimulation of two Wngertips was followed by the stimulation of their opposite counterparts. Participants were given a 2 min practice before starting the Wrst block of each task condition (dual or single task). Altogether, the experimental session lasted around 40 min.

Results Tactile task The hit rate (responses within 250 and 3,250 ms from target onset) was 0.41, and false alarm rate (any other response) was 0.06. No diVerences were found between groups [hits: t(31) = 1.28, P = 0.209; fa: t(31) = 0.81, P = 0.424) or conditions [AV vs. A, hits: F(1,15) = 0.892, P = 0.36; fa: F(1,15) = 0.434, P = 0.52; AV vs. V, hits: F(1,15) = 2.00, P = 0.177; fa: F(1,15) = 0.492, P = 0.492).

2

A within-subjects design was used after the evidence provided by a previous pilot experiment in which, as in Alsius et al (2005), the eVects of task were tested in a between-participants design. This previous study showed a trend towards the same attentional eVects reported here, but these eVects did not reach signiWcance. As attentional competition across modalities has been shown to be less strong between than within a sensory modality (e.g., Eimer and Van Velzen 2002; Hillyard et al. 1984; McDonald and Ward 2000), a within-participants design was implemented in the present experiment, in order to gain statistical power and detect any small but reliable eVects of attention potentially aVecting audiovisual integration.

123

402

Exp Brain Res (2007) 183:399–404

Table 1 Mean proportion of each response type (SE within parenthesis) as a function of task and display condition Group AV_A

Group AV_V

Task

Single

Dual

Single

Dual

Display condition

AV

Audio

AV

Audio

AV

Visual

AV

Visual

Auditory

0.23 (0.04)

0.75 (0.03)

0.33 (0.04)

0.74 (0.03)

0.25 (0.06)

0.00 (0.00)

0.31 (0.06)

0.00 (0.00)

Visual/fusion (McGurk)

0.67 (0.04)

0.20 (0.04)

0.56 (0.03)

0.17 (0.02)

0.68 (0.06)

0.10 (0.03)

0.57 (0.06)

0.12 (0.02)

Other

0.10 (0.02)

0.05 (0.02)

0.10 (0.01)

0.08 (0.02)

0.07 (0.02)

0.90 (0.03)

0.12 (0.02)

0.88 (0.02)

Response types

Each average is based on 16 subjects £ 13 trials, for a total of 208 observations per cell. In an additional ANOVA using the proportion of auditory responses as the dependent measure, the Wndings reported in the main text were conWrmed. In particular, there was a signiWcant interaction between task (single vs. dual) and display condition [F(1,30) = 5.19, P < 0.05]. The interaction was caused by the signiWcant task eVect in the bimodal (AV) display condition [t(31) = 2.32, P < 0.05] but not in the unimodal (A and V) display condition (|t| < 1). When each unimodal display condition (Group) was analysed separately, still no eVects of task manipulation were observed in the unimodal auditory or visual display conditions (both |t| < 1)

Word recall task Responses were scored (see Table 1) as auditory when they matched the auditory-recorded word. Visual and fusion responses were pooled under one category (given that the expected fused response coincided with the visually presented word in many of the stimuli). Any word that did not match with any of these alternatives was scored as “others”. Given that the object of interest was to measure visual eVects on heard speech, the proportion of visually inXuenced responses in the word recall task was used as a dependent variable in all conditions, including the ones serving as baseline (in the auditory only condition this measurement provided an index for those visually-induced responses that could be expected just by chance; note that, as the word combinations used to produce the McGurk illusion diVer just in one phoneme and were acoustically similar, the distribution of errors might be far from random). These data was submitted to an analysis of variance (ANOVA) with two within-participants factors: Display Condition (bimodal vs. unimodal) and Task (single vs. dual); and one between-participants factor: Group (Group AV_A vs. Group AV_V). The eVect of Display Condition was signiWcant [F(1,30) = 227.25, P < 0.001], whereas the main eVect of Task fell short to signiWcance [F(1,30) = 3.42, P = 0.075]. Crucially, the interaction between Task and Condition resulted signiWcant [F(1,30) = 6.86, P < 0.05], indicating that Task had an eVect in the audiovisual (bimodal) condition (56% visually inXuenced responses under the dual task condition, vs. 67% under single task; t(31) = ¡2.6, P < 0.02), but not in the unimodal condition [t(31) = ¡0.25, P = 0.80; see Fig. 2). Neither the main eVect of Group [F(1,30) = 1.33, P = 0.26] nor the interactions Task by Group, or Group by Display Condition were signiWcant [F(1,30) = 0.12, P = 0.73 and F(1,30) = 1.55, P = 0.22; respectively).

123

Fig. 2 Average proportion of fusion/visual responses (McGurk eVect) shown separately for bimodal and unimodal display conditions, and each task condition separately (white bars single task, grey bars dual task). The error bars represent the SE of the mean. The asterisk denotes a signiWcant diVerence between dual and single task

Given that the diVerent unimodal conditions were ran in diVerent groups, and in order to rule out that the null task eVect obtained in the unimodal condition was caused by opposing task eVects at each (auditory and visual) unimodal groups,3 we then tested for the reliability of the critical Task by Condition interaction in each group separately. Mirroring the pattern found in the main analysis, in Group AV_A the task manipulation had an eVect in audiovisual displays [t(15) = ¡2.27, P < 0.05] but not in auditory alone displays [t(15) = ¡0.758, P = 0.46). Similarly, in Group AV_V Task resulted signiWcant in bimodal displays [t(15) = ¡2.2, P < 0.05] but not in the visual alone displays [t(15) = 0.39, P = 0.67]. Note that, determining whether the secondary 3 Note that, if demands of the concurrent task prevented both the auditory and visual unimodal processing, one would expect to observe a reduction of the visually inXuenced responses in the visual condition, but an increase of these responses in the auditory displays (participants’ misunderstanding of the auditory words would lead to an increase of their visual counterparts, due to phonological similarity).

Exp Brain Res (2007) 183:399–404

tactile task has a larger eVect on the audiovisual condition than on the control unimodal conditions is crucial to specify the processing stage at which attention interacts with audiovisual speech integration. That is, rather than modulating the audiovisual integration mechanism per se, attention demands might have selectively reduced the processing of one of the two unimodal sources of information, before audiovisual speech integration takes place (see, Massaro 1998; Tiippana et al. 2004). Nevertheless, the eVect of the secondary task on the percentage of visual/fusion responses observed in the audiovisual condition combined with the lack of task eVect in the control unimodal conditions (auditory-only and visual-only) clearly suggests that tactile attention modulated speech processing at the level of audiovisual integration, beyond unimodal levels of processing. As expected, the percentage of visual inXuenced responses in the auditory-only conditions was rather low and, in fact, possibly attributable to acoustic errors given the 40 dB white noise being played throughout the experiment. Note that this condition did not show any eVect of task even when considering the proportion of auditory-based responses (see Table 1). The visual-only condition led to a percentage of visual correct responses clearly above chance (10 and 12% correct recall in the single and dual task, respectively), therefore the potential role of Xoor eVects in the interaction is unlikely. Yet, to further strengthen the reliability of the interaction, we performed additional analyses after re-scoring the responses to the unimodal visual condition according to a more liberal criterion that allowed for variations within the phonological equivalence classes (see Mattys et al. 2002) of the critical phoneme in the visual words. Under this new scoring criterion the proportion of correct responses was obviously higher and yet, reinforcing our conclusions, no diVerences in participants’ visual/fusion responses were found between the dual and the single task conditions [57 vs. 60%; t(17) = 0.797, P = 0.436].

General discussion The result of the present study clearly revealed that increasing demands on tactile attention aVects the integration of audiovisual speech. Critically, this modulation occurred despite the fact that the attention eVect was not found when testing performance in the respective unimodal control conditions. Therefore, a global deterioration in recall performance due to the costs of dual task cannot account for the present results (i.e., an interference at the level of output mechanisms should have produced a decline also in the unimodal dual task blocks). Moreover, a general interference would predict an increase in “other” responses to AV stimuli in the dual task, rather than in the increase speciWc to auditory responses that was in fact observed (see Table 1).

403

The favoured interpretation is, instead, that the diversion of attention resources (to the tactile task) had a signiWcant impact on audiovisual integration processes. Although it might be argued that tactile attention exerted an eVect on visual speech processing, which carried forward to the integration stage, we did not Wnd any evidence of task eVects on the visual-alone control condition (Group AV_V). This was true even when analysing the data from visual only conditions with a liberal criterion which eVectively avoids any Xoor eVects. A possible explanation for decline in AV integration (and the increase in auditory responses) under dual task is that when attention demands exceeded available resources, the burden of processing would be reduced by Wltering out the least informative modality overall: the visual information provided by the articulatory movements of the speaker. If this were the case, then one should have to conclude that the attentional manipulation (in touch) modulated the weight given to visual information during audiovisual speech perception. Whether this is the particular way in which attention aVects multisensory integration or not needs to be further investigated. The present result clearly challenges strong pre-attentive accounts of audiovisual integration (McGurk and MacDonald 1976; Massaro 1987; Soto-Faraco et al. 2004; Colin et al. 2002; Bernstein et al. 2004) and goes along the lines of several reports claiming that cognitive processes can modulate audiovisual integration, both regarding speech (Alsius et al. 2005; Massaro 1998; Soto-Faraco and Alsius 2006; Tiippana et al. 2004; Tuomainen et al. 2005), and non-speech (Talsma and WoldorV 2005; Talsma et al. 2007; Fujisaki et al. 2006). Our results also support previous suggestions of links in endogenous attention between touch, vision and audition (Eimer et al. 2002). However, whereas previous demonstrations of attentional interdependence among modalities have been particularly well explored in the spatial domain (see Spence and Driver 2004, for a review), the current Wnding suggests that these links may also constrain the binding of crossmodal information. This suggests that interactions between the attentional system and crossmodal binding mechanisms may be much more extensive and dynamic than it was advanced by some previous studies (Alais et al. 2006; Duncan et al. 1997; Rees et al. 2001; Wickens, 1984). The main result emerging from the present study is clear in that overloading tactile attention—a modality not involved in audiovisual speech integration—can modulate the crossmodal binding of visual and acoustic speech signals. This Wnding places the attentional processes involved in audiovisual speech integration at a general, as opposed to modality-speciWc, level. Top–down modulatory signals coming from highorder brain structures and involved in directing voluntary attention could be at the origin of this modulations (e.g., Kanwisher and Wojciulik 2000). This conceptualization Wts well with neuroimaging studies investigating attentional

123

404

inXuence on brain correlates of sensory processing outside the domain of speech (Amedi et al. 2001; Macaluso et al. 2002). According to these studies, crossmodal links may not solely rely on feed-forward convergence from unisensory regions to multimodal brain areas, but also may implicate back-projections to multiple levels of (early) sensory processing that are based on current task demands (Calvert et al. 1999, 2000). References Alais D, Morrone C, Burr D (2006) Separate attentional resources for vision and audition. Proc Biol Sci 273(1592):1339–1345 Amedi A, Malach R, Hendler T, Peled S, Zohary E (2001) Visuo-haptic object-related activation in the ventral visual pathway. Nat Neurosci 4:324–330 Alsius A, Navarra J, Campbell R, Soto-Faraco S (2005) Audiovisual integration of speech falters under high attention demands. Curr Biol 15(9):839–843 Bernstein LE, Auer ET Jr, Moore JK (2004) Audiovisual speech binding: convergence or association? In: Calvert GA, Spence C, Stein BE (eds) The handbook of multisesensory processes. The MIT Press, Cambridge, pp 203–224 Bertelson P, Radeau M (1981) Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys 29:578–584 Bertelson P, Vroomen J, de Gelder B, Driver J (2000) The ventriloquist eVect does not depend on the direction of deliberate visual attention. Percept Psychophys 62(2):321–332 Burnham D, Dodd B (2004) Auditory-visual speech integration by prelinguistic infants: perception of an emergent consonant in the McGurk eVect. Dev Psychobiol 45(4):204–220 Calvert GA, Brammer MJ, Bullmore ET, Campbell R, Iversen SD, David AS (1999) Response ampliWcation in sensory-speciWc cortices during cross-modal binding. Neuroreport 10:2619–2623 Calvert GA, Campbell R, Brammer MJ (2000) Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Curr Biol 10(11):649–657 Colin C, Radeau M, Soquet A, Demolin D, Colin F, Deltenre P (2002) Mismatch negativity evoked by the McGurk–MacDonald eVect: a phonetic representation within short-term memory. Clin Neurophysiol 113:495–506 de Gelder B, Bertelson P (2003) Multisensory integration, perception and ecological validity. Trends Cogn Sci 7(10):460–467 Degerman A, Rinne T, Pekkola J, Autti T, Jääskeläinen I, Sams M, Alho K (2007) Human brain activity associated with audiovisual perception and attention. Neuroimage 34(4):1683–1691 Duncan J, Martens S, Ward R (1997) Restricted attentional capacity within but not between sensory modalities. Nature 387:808–810 Eimer M, Van Velzen J (2002) Crossmodal links in spatial attention are mediated by supramodal control processes: evidence from event-related brain potentials. Psychophysiol 39:437–449 Eimer M, van Velzen J, Driver J (2002) Crossmodal interactions between audition, touch and vision in endogenous spatial attention: ERP evidence on preparatory states and sensory modulations. J Cogn Neurosci 14:254–271 Fujisaki W, Koene A, Arnold D, Johnston A, Nishida S (2006) Visual search for a target changing in synchrony with an auditory signal. Proc R Soc B 273:865–874 Ghazanfar AA, Logothetis NK (2003) Facial expressions linked to monkey calls. Nature 423:937–938 Green KP, Kuhl PK (1991) Integral processing of visual place and auditory voicing information during phonetic perception. J Exp Psychol Hum Percept Perform 17:278–288

123

Exp Brain Res (2007) 183:399–404 Hillyard SA, Simpson GV, Woods DL, VanVoorhis S, Münte TF (1984) Event-related brain potentials and selective attention to diVerent modalities. In: Reinoso-Suarez F, Aimone-Marsan C (eds) Cortical integration. Raven, New York, pp 395–413 Kaiser J, Hertrich L, Ackermann H, Mathiak K, Lutzenberger W (2004) Hearing lips: gamma-band activity during audiovisual speech perception. Cereb Cortex 15:646–653 Kanwisher N, Wojciulik E (2000) Visual attention: insights from brain imaging. Nat Rev Neurosci 1:91–100 Khul PK, MeltzoV AN (1982) The bimodal perception of speech in infancy. Science 218:1138–1141 Lavie N (1995) Perceptual load as a necessary condition for selective attention. J Exp Psychol Hum Percept Perform 21:451–468 Lewkowicz DJ, Ghazanfar AA (2006) The decline of cross-species intersensory perception in human infants. Proc Natl Acad Sci USA 103:6771–6774 Macaluso E, Frith CD, Driver J (2002) Supramodal eVects of covert spatial orienting triggered by visual or tactile events. J Cogn Neurosci 14(3):389–401 McDonald JJ, Ward LM (2000) Involuntary listening aids seeing: evidence from human electrophysiology. Psychol Sci 11:167–171 Mattys S, Bernstein LE, Auer ET (2002) Stimulus-based lexical distinctiveness as a general word recognition mechanism. Percept Psychophys 64(4):667–679 Massaro DW (1987) Speech perception by ear and eye. LEA, Hillsdale Massaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. MIT Press, Cambridge McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 265:746–748 Pick HL Jr, Warren DH, Hay JC (1969) Sensory conXict in judgements of spatial direction. Percept Psychophys 6:203–205 Rees G, Frith CD, Lavie N (2001) Perception of irrelevant visual motion during performance of an auditory task. Neuropsychologia 39:937–949 Soto-Faraco S, Alsius A (2006) Conscious access to the unisensory components of a cross-modal illusion. Neuroreport 18:347– 350 Soto-Faraco S, Navarra J, Alsius A (2004) Assessing automaticity in audiovisual speech integration: evidence from the speeded classiWcation task. Cognition 92:B13–B23 Spence C, Driver J (eds) (2004) Crossmodal space and crossmodal attention. Oxford University Press, Oxford Talsma D, WoldorV MG (2005) Selective attention and multisensory integration: multiple phases of eVects on the evoked brain activity. J Cogn Neurosci 7(17):1098–1114 Talsma D, Doty T, WoldorV MG (2007) Selective attention and audiovisual integration: is attending to both modalities a prerequisite for early integration? Cereb Cortex 17:679–690 Tiippana K, Andersen TS, Sams M (2004) Visual attention modulates audiovisual speech perception. Eur J Cogn Psychol 16:457–472 Tuomainen J, Andersen TS, Tiippana K, Sams M (2005) Audio-visual speech perception is special. Cognition 96(1):B13–B22 Van Atteveldt NM, Formisano E, Goebel R, Blomert L (2007) Top– down task eVects overrule automatic multisensory responses to letter–sound pairs in auditory association cortex. Neuroimage 36(4):1345–1360 Vroomen J., Driver J, de Gelder B (2001a) Is cross-modal integration of emotional expressions independent of attentional resources? Cogn AVect Behav Neurosci 1:382–387 Vroomen J, Bertelson P, de Gelder B (2001b) The ventriloquist eVect does not depend on the direction of automatic visual attention. Percept Psychophys 63:651–659 Wickens CD (1984) Processing resources in attention. In: Parasuraman R, Daves DR (eds) Varieties of attention. Academic Press, Orlando, pp 63–101