Feature-based processing of audio-visual ... - Semantic Scholar

25 downloads 25098 Views 461KB Size Report
temporal-frequency limit of audio-visual temporal syn- chrony perception for .... In brief, visual stimuli were presented with a VSG2/5 Visual Stimulus Generator.
Vision Research 47 (2007) 1075–1093 www.elsevier.com/locate/visres

Feature-based processing of audio-visual synchrony perception revealed by random pulse trains Waka Fujisaki 1, Shin’ya Nishida

*

NTT Communication Science Laboratories, NTT Corporation, Atsugi, Kanagawa 243-0198, Japan Received 22 June 2006; received in revised form 6 January 2007

Abstract Computationally, audio-visual temporal synchrony detection is analogous to visual motion detection in the sense that both solve the correspondence problem. We examined whether audio-visual synchrony detection is mediated by a mechanism similar to low-level motion sensors, by one similar to a higher-level feature matching process, or by both types of mechanisms as in the case of visual motion detection. We found that audio-visual synchrony–asynchrony discrimination for temporally dense random pulse trains was difficult, whereas motion detection is known to be easy for spatially dense random dot patterns (random dot kinematograms) due to the operation of low-level motion sensors. Subsequent experiments further indicated that the temporal limiting factor of audio-visual synchrony discrimination is the temporal density of salient features not the temporal frequency of the stimulus, nor the physical density of the stimulus. These results suggest that audio-visual synchrony perception is based solely on a salient feature matching mechanism similar to that proposed for high-level visual motion detection. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Audio-visual; Temporal synchrony; Correspondence problem; Temporal crowding; Saliency based matching

1. General introduction In our daily lives, we encounter environments where visual signals are often accompanied by concomitant auditory signals arising from the same event. Human observers integrate such an audio-visual signal pair into a coherent percept of a single multi-modal event. Since it is unlikely that audio-visual signals of the same physical cause are far separated in time, it is not surprising that physical temporal proximity (approximate synchrony or simultaneity) is a critical condition for subjective audio-visual integration (Munhall, Gribble, Sacco, & Ward, 1996; Shams, Kamitani, & Shimojo, 2002; Watanabe & Shimojo, 2001). However, previous studies have not fully revealed how the human sensory system detects audio-visual synchrony. Specifically, it remains an open problem as to which level *

1

Corresponding author. Fax: +81 46 240 4716. E-mail address: [email protected] (S. Nishida). Research Fellow of the Japan Society for the Promotion of Science.

0042-6989/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.visres.2007.01.021

of sensory processing is involved and what sort of representations and algorithms are used for temporal matching (Marr, 1982). Several lines of evidence argue against a simple view that sensory modalities are separate modules that interact with each other only at post-sensory processing levels (Shimojo & Shams, 2001; Spence & Driver, 2004). Neurophysiological studies have shown the existence of multisensory neurons in the superior colliculus and polisensory cortex, as well as the existence of cross-modal interactions even in primary sensory areas (Schroeder & Foxe, 2005; Stein & Meredith, 1993). It has also been shown that early components of event-related potentials could be influenced by redundant audio-visual information (Lebib, Papo, de Bode, & Baudonniere, 2003; Musacchia, Sams, Nicol, & Kraus, 2006; van Wassenhove, Grant, & Poeppel, 2005). Behaviorally, it has been suggested that the ventriloquist effect, an illusory visual capture of the spatial location of an auditory signal occurs at early pre-attentive levels, since it does not depend on the direction of automatic or

1076

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

deliberate visual attention (Bertelson, Vroomen, de Gelder, & Driver, 2000; Vroomen, Bertelson, & de Gelder, 2001), and can modulate the location of auditory attention (Spence & Driver, 2000). Other phenomena that could be interpreted as suggesting early binding of audio-visual signals include the enhanced audibility/visibility of coupled audio-visual signals (Odgaard, Arieh, & Marks, 2004; Sheth & Shimojo, 2004; Stein, London, Wilkinson, & Price, 1996),2 perceptual integration of visual and auditory motion signals (Meyer, Wuerger, Rohrbein, & Zetzsche, 2005; Soto-Faraco, Spence, & Kingstone, 2005),3 visual modulation of auditory perception (McGurk & MacDonald, 1976; Soto-Faraco, Navarra, & Alsius, 2004), and auditory modulation of visual perception (Gebhard & Mowbray, 1959; Recanzone, 2003; Sekuler, Sekuler, & Lau, 1997; Shimojo & Shams, 2001; Shipley, 1964). It is possible to interpret these findings (audio-visual interactions in anatomically peripheral brain areas, temporally fast responses, or preattentive sensory processes) to indicate that at least some audio-visual interactions reside at relatively early processing levels. However, any argument about the level of processing is likely to raise controversy unless there is a conceptual clarification of the potential mechanisms for each level. In examining the level of processing for audio-visual synchrony detection, our psychophysical study was intended to investigate functional levels, which may or may not correspond to anatomical hierarchies. As a conceptual framework, we conceived a concrete hypothesis about potential low- and high-level functional mechanisms for audio-visual synchrony detection by referring to the mechanisms of a similar, and more extensively studied problem — visual motion detection. Computationally, visual motion detection is analogous to audio-visual temporal synchrony detection in the sense that both solve the correspondence problem (Marr, 1982). That is, while the task of audiovisual synchrony detection is to find correspondence between signals from different modalities on the basis of temporal proximity, the task of visual motion detection is to find correspondence between visual signals on the basis of spatiotemporal proximity (Dawson, 1991; Ullman, 1979). Although the same problem is shared with other perceptual processes including binocular stereopsis and binaural sound localization (Banks, Burr, & Morrone, 2006), a merit of comparison with motion detection is that we have good models for low- and high-level motion processing.4 The extensive study of visual motion detection has so far revealed the existence of at least two types of detection 2

Some effects however might be explained by a response bias change (Odgaard, Arieh, & Marks, 2003). 3 Audio-visual perceptual integration is not always supported (Alais & Burr, 2004). 4 Binocular stereopsis is also known to involve multiple mechanisms (Julesz, 1971; Liu, Stevenson, & Schor, 1994; Ramachandran, Rao, & Vidyasagar, 1973; Wilcox & Hess, 1997), but it is open as to whether it includes a high-level feature matching mechanism as proposed for motion processing (Cavanagh, 1991, 1992; Lu & Sperling, 1995a, 1995b, 2001).

mechanisms (e.g., Braddick, 1974; Cavanagh & Mather, 1989; Chubb & Sperling, 1988; Lu & Sperling, 2001; Nishida & Ashida, 2001; Nishida, Ledgeway, & Edwards, 1997; Nishida & Sato, 1992; Nishida & Sato, 1995). One exploits low-level specialized sensors that compute motion directly from raw sensory signals. Braddick (1974) introduced the notion of low-level motion sensors under the name of the short-range process to account for his finding that a random dot kinematogram is correctly perceived only with short displacements. Nowadays, this low level motion detecting mechanism is more often called the first-order motion sensor, since later studies showed that it is not characterized by the operating spatial range, but by the type of input signals (a first-order spatial property, luminance) (Cavanagh & Mather, 1989; Chang & Julesz, 1983; Chubb & Sperling, 1988). The computation of this mechanism is considered to be a cross-correlation of spatiotemporally separate luminance signals (Reichardt, 1961) with peripheral spatiotemporal bandpass filters (van Santen & Sperling, 1985), or nearly mathematically equivalent computation of spatially local motion energy within a given band of spatiotemporal frequency (Adelson & Bergen, 1985; Watson & Ahumada, 1985). The use of raw sensory signals by this mechanism is suggested by the finding that the motion sensors are most sensitive within the whole visual system when the stimuli are low-spatial-frequency and high-temporalfrequency luminance modulations (Watson & Ahumada, 1985; Watson & Robson, 1981). The visual system may also include low-level motion sensors specialized for detecting movements of second-order spatial or temporal properties, such as contrast modulation and flicker modulation (Cavanagh & Mather, 1989; Chubb & Sperling, 1988; Lu & Sperling, 1995b; Nishida et al., 1997). These secondorder motion sensors are suggested to have a structure similar to the first-order motion sensor except for non-linear preprocessing (Chubb & Sperling, 1988). In addition to these low-level motion sensors, the visual system has a high-level motion mechanism, which has been called the long-range motion process (Braddick, 1974), attentive tracking (Cavanagh, 1991, 1992), or the thirdorder motion mechanism (Lu & Sperling, 1995a, 1995b, 2001). The existence of this mechanism was inferred from motion perceptions that cannot be detected by first-order motion sensors, or by second-order motion sensors (Cavanagh, 1991; Lu & Sperling, 1995a). A representative stimulus is the inter-attribute apparent motion, in which the first element distinguished from the background in an arbitrary stimulus dimension (e.g., luminance, color, texture, depth, motion) is perceived to move to the second element defined by another dimension (Cavanagh, Arguin, & von Gru¨nau, 1989; Lu & Sperling, 1995a). Lu and Sperling (1995a, 1995b, 2001) propose that this high-level motion computation uses feature-independent, common representation, which they called stimulus ‘‘salience’’, as input of a motion detector (a spatiotemporal comparator similar to those for low-level motion sensors). They used the term salience to describe the assumed neural process that underlies the

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

perception of figure-ground. Whereas figure-ground is intrinsically a binary variable, salience is a continuous variable, with larger values more likely to be perceived as figure and smaller values more likely to be perceived as ground (Lu & Sperling, 2001). The magnitude of salience depends on the stimulus distinctiveness from the background and is increased by attention directed to the figure. By introducing common feature-independent representations, their theory can account for the inter-attribute motion without assuming numerous detectors specialized for various combinations of features. The theory is also consistent with other properties of the high-level motion perception, such as low temporal limits (3–6 Hz) and strong influence of attention. The way to compute a salience map from an input image remains an open question, but several promising suggestions have been made on this issue (Itti & Koch, 2001; Li, 2002). Introducing this dichotomy to the realm of audio-visual processing, we frame the question as whether audio-visual synchrony is detected by functionally similar low-level sensors, by a functionally similar higher-level feature matching process, or by both of them as in the case of visual motion. We assume that a low-level sensor would be a mechanism that computes the cross-correlation of audiovisual raw sensory signals. This is a fast process that properly operates even when stimulus changes are rapid, and preprocessing, if it exists at all, is a simple temporal filtering. Different sensors are prepared for different combinations of visual and auditory attributes. Selection of matching features is automatic, and affected little by attention. On the other hand, a high-level process would be a mechanism that computes the cross-correlation of salient feature sequences transformed from audio and visual sensory signals. The process to extract and compare salient features is relatively slow. A common mechanism could operate for different combinations of visual and auditory attributes. Selection of matching features is strongly affected by attention. Within this framework, our previous attempts to identify the mechanism of audio-visual synchrony detection (Fujisaki, Koene, Arnold, Johnston, & Nishida, 2006; Fujisaki & Nishida, 2005b) could be interpreted to support the salient-feature-matching hypothesis. We found that the temporal-frequency limit of audio-visual temporal synchrony perception for repetitive stimuli was as low as 4 Hz (Fujisaki & Nishida, 2005b). Beyond this limit, although observers could clearly perceive temporal modulations of audio and visual signals, they could not tell whether they were in synchrony or not (a temporal crowding effect). Another study showed that the search of a visual target changing in synchrony with an auditory signal was serial (Fujisaki et al., 2006). This suggests that attentional selection of the location of visual event should precede audio-visual synchrony detection. It should also be mentioned that audio-visual synchrony detection is similar to the inter-attribute apparent motion in that both have to bind heterogeneous signals.

1077

The purpose of the present study was to obtain further support for the salient-feature matching hypothesis for audio-visual synchrony perception, using the experimental paradigms established in studies of visual motion. The ability to perceive clear motion in spatially dense random dot patterns, which contain no salient features, has been regarded as evidence of the existence of low-level motion sensors (Braddick, 1974; Braddick, 1980).5 The high-level feature-matching mechanism may contribute to random dot kinematogram, but only when the stimulus density is sparse (Sato, 1999). Here we tested the existence of low-level audio-visual synchrony detectors using a temporally dense random pulse train as a temporal analogue of a random dot pattern. If there are low-level sensors, audiovisual temporal synchrony judgments will be possible even when the stimulus density is high. Since the random pulse stimulus contains a broad range of temporal frequency modulations, low-level sensors will be activated regardless of the sensors’ temporal frequency tunings. Contrary to this prediction, we found a severe impairment of audiovisual synchrony judgment for high-density random pulse trains. As the pulse density was decreased, audio-visual synchrony perception gradually improved. These findings argue against the presence of low-level sensors. On the other hand, the observed effect of pulse density is consistent with a hypothesis that there is only a salient-feature matching mechanism, since some salient temporal features, such as a brief pause and a distinctive pulse chunk, are obviously less evident in dense stimuli than in sparse stimuli. Subsequent experiments further showed that the critical parameter is the temporal density of salient features, not the temporal frequency of the stimulus nor the physical density of the stimulus. The results of this study, together with the results of our prior reports (Fujisaki, Shimojo, Kashino, & Nishida, 2004; Fujisaki & Nishida, 2005b; Fujisaki et al., 2006), suggest that audio-visual synchrony detection is based on a salient-feature-matching mechanism that uses representations and algorithms similar to those used in high-level visual motion detection. To evaluate the perceptual accuracy of audio-visual synchrony detection, we measured the participants’ performance in discriminating an asynchronous audio-visual stimulus from a physically synchronous one. The results give an objective measure of the accuracy of audio-visual lag judgment, which characterizes the perceptual mechanism that underlies the perception of audio-visual synchrony. An alternative subjective method, i.e., measuring the probability of reporting apparent synchrony as a function of the audio-visual time lag (Dixon & Spitz, 1980), could severely suffer from a variation in the participants’ criterion of ‘‘simultaneity’’. When a generous criterion is applied, the participants would judge different time lags as belonging to a ‘‘synchrony’’ window even though they 5

This argument followed an idea of considering the perception of dense random dot stereogram as the evidence of the existence of low-level binocular stereo mechanism (Julesz, 1971).

1078

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

could discriminate one lag from another. This is a serious problem for rapid stimuli, since the perception of audiovisual synchrony may be a result of an illusion known as auditory driving (Shipley, 1964), in which visual events always appear to occur in synchrony with the auditory events even when audio-visual signals have different temporal frequencies. To ascertain the relationship between the two measures, we compared the results obtained with the two methods in a control experiment. A part of this study was presented at the 6th Annual meeting of the International Multisensory Research Forum, Rovereto, Italy (Fujisaki & Nishida, 2005a), and the European Conference on Visual Perception 2005, A Corun˜a, Spain (Nishida & Fujisaki, 2005). 2. Experiment 1: Audio-visual synchrony–asynchrony discrimination for dense random pulse trains 2.1. Introduction We measured audio-visual synchrony–asynchrony discrimination performance for dense random pulse trains to see whether audio-visual matching is possible for what we considered to be a temporal analogue of a random dot kinematogram. 2.2. Methods 2.2.1. Participants Participants were the two authors and three paid volunteers who were unaware of the purpose of the experiments. All the participants, including those who took part in the subsequent experiments, had normal or corrected-to-normal vision and hearing. Informed consent was obtained before the experiment started. 2.2.2. Apparatus The apparatus of this and the subsequent experiments was identical to that used in our previous reports (Fujisaki et al., 2006; Fujisaki & Nishida, 2005b). In brief, visual stimuli were presented with a VSG2/5 Visual Stimulus Generator (Cambridge Research Systems), and auditory stimuli were presented with a TDT Basic Psychoacoustic Workstation (Tucker–Davis Technologies). In a quiet dark room, the participant sat 57 cm from a monitor (SONY GDM-F500, frame rate: 160 Hz) while wearing headphones (Sennheiser HDA 200). 2.2.3. Stimulus The visual stimulus was a luminance-modulated Gaussian blob (standard deviation: 2.0 deg) presented at the center of the monitor screen (see Fig. 1). A large Gaussian blob was used since the visual response is known to be rapid for low-spatial-frequency luminance modulations (Kelly, 1979). The background was a 21.5 cd/m2 uniform field subtending 38.7 deg in width and 29.5 deg in height. The luminance increment of the blob peak was temporally

modulated between 0 and 43 cd/m2. Nothing was visible during the off period. The fixation marker was a bullseye presented before stimulus presentation at the center of the monitor screen. Participants were instructed to view the visual stimulus while fixating this location. The auditory stimulus was a 100% amplitude-modulated white noise with the sound pressure level of about 54 dB SPL at the peak of modulation. It was presented diotically via headphones with a sampling frequency of 24420 Hz. The audio-visual stimuli were modulated by random pulse trains. At the highest dot density (80/s), the pulse density was prescribed by the refresh rate of the monitor (=160 Hz). For each monitor refresh, a pulse appeared randomly with 50% probability. A visual pulse was a single flash, and an auditory pulse was a pip that lasted for 6.25 ms (equivalent to the period of 1 display frame). The stimulus density was fixed at 80/s in Experiment 1, but it was manipulated by changing the probability of stimulus appearance in subsequent experiments. This resulted in the stimulus having a flat temporal-frequency power spectrum below 80 Hz regardless of the pulse density. An audio-visual random pulse train was physically synchronous, or asynchronous with a given audio-visual delay. In either case, the stimulus lasted 2 s, with onset and offset being physically aligned between audio and visual stimuli. A new random sequence was generated for each stimulus. For synchronous stimuli, we used an identical random pulse sequence for the two modalities. For asynchronous stimuli, we applied time-shift and wrap-around operations to the sequence of one modality.6 2.2.4. Procedures The percent correct for discriminating synchrony–asynchrony for a given stimulus condition was measured using a single-interval binary response task with feedback. Before the start of the experiment, the participants were given a detailed verbal description of the task. In a trial, about 2 s after the last participant’s response, the fixation marker was removed, and either a physically synchronous audio-visual stimulus or an asynchronous one with a given audio-visual delay, which was fixed within a given block, was presented. The participant had to make a twoalternative forced response (synchronous/asynchronous) by pressing a VSG response box key. Feedback was provided after each response by changing the color of the fixation marker, where blue and red indicated that the stimulus of the trial was ‘synchronous’ and ‘asynchronous’, respectively. Within a block of trials, we fixed all the stimulus parameters other than the difference between synchronous and asynchronous trials described above. Each block consisted of 20 trials plus four initial practice trials. During the practice trials, synchronous and asynchronous stimuli were 6 Pv(t) = Pa(t d + 2000) when t d 6 0, Pv(t) = Pa(t d) when 0 < t d 6 2000, and Pv(t) = Pa(t d 2000) when t d > 2000, where Pv: visual pulse, Pa: auditory pulse, t: time variable in ms, and d: delay constant in ms.

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

a

Synchrony

1079

Asynchrony (A first)

V

V

A

A Time

c

Amp

b

Time

Time

d

Cross correlation

Fig. 1. Standard stimulus used in this study. (a) Temporal profiles of audio-visual stimuli. Both stimuli were modulated by the same random pulse trains, being in synchrony or in asynchrony. Dotted arrows indicate corresponding pulse pairs. (b) Spatial configuration of the visual display (Gaussian blob). (c) An example of the auditory stimulus (amplitude modulated white noises). (d) Temporal cross-correlation of synchronous visual and auditory signals.

2.3. Results Fig. 2 shows the proportion correct averaged over the five participants, plotted as a function of the time lags

1

Proportion correct

presented in turns. In the main trials, 10 trials for synchronous stimulus, and 10 trials for asynchronous stimulus were randomly ordered. The pulse density was fixed at the highest value (80/s), and the absolute audio-visual lag of the asynchronous stimuli was varied between blocks. It ranged from 500 to +500 ms in 18 steps (nine for each sign), where a negative lag indicates that the audition stream preceded the visual one. Each session consisted of four or five blocks for different lag values. Each participant ran at least 2 blocks (40 trials) for a given signed lag condition, based on which the proportion correct of synchrony discrimination was computed. The order of data collections for different stimulus conditions was counterbalanced in such a way to minimize possible effects of order. The purpose of the feedback was to maximize the discrimination performance by excluding a type of error where the participants could discriminate the two lag conditions but not correctly label the physically ‘synchronous’ pair as ‘synchronous’. Note also that a subsequent experiment (Experiment 5) suggests that the participants indeed had a subjective impression of synchrony when the stimulus was physically synchronous.

0.75

0.5

0.25 Random pulse Single pulse Single pulse fit 0 10

100

1000

Time lag (ms) Fig. 2. Results of Experiment 1. Proportion correct of discriminating synchrony and asynchrony of dense (80/s) random pulse trains is plotted as a function of the time lags between audio-visual signals. Each data point represents the average for five participants, with the error bar indicating the standard error. Also plotted is the synchrony–asynchrony discrimination performance measured for the same participants using a pair of a single audio pulse and a single visual pulse (Experiment 1 of Fujisaki & Nishida, 2005b).

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

between audio-visual signals. The data for audition-first and vision-first conditions were averaged, since there was no significant difference between them. Plotted also is the synchrony–asynchrony discrimination performance measured for the same participants using a pair comprising a single audio pulse and a single visual pulse (from Fujisaki & Nishida, 2005b, Experiment 1). The results indicate that audio-visual synchrony–asynchrony discrimination with dense random pulse trains was almost impossible, even when the audio-visual time lags were large enough for discrimination with single pulses. A one-way analysis of variance for normalized proportion correct scores, ln[p/(1 p)] (Tukey, 1977) indicated that the effect of time lag was not significant for dense random pulse trains (F(8, 32) = .88, p = .55), but was significant for single pulse (F(8, 32) = 33.96, p < .01).7

4. Experiment 3: Vision–vision or audio–audio synchrony– asynchrony discrimination for dense random pulse trains 4.1. Introduction One might suspect that the observed difficulty in audiovisual synchrony discrimination for dense random pulse trains could be ascribed to temporal limits of peripheral sensory processing. To test this possibility, Experiment 3 measured discrimination performance using within-modal stimuli.

a

1

Proportion correct

1080

0.75

3. Experiment 2: Audio-visual synchrony–asynchrony discrimination for random pulse trains of various densities 3.1. Introduction To characterize the effect of pulse density on audiovisual synchrony–asynchrony discrimination, we systematically varied the pulse density.

Fig. 3a shows the results. While the audio-visual synchrony–asynchrony discrimination performance was nearly at chance level for the density of 80/s, it gradually improved as the dot density was lowered, reaching >90% for the density of 13.3/s or lower. A one-way analysis of variance for normalized scores indicated that the effect of stimulus density was significant (F(8, 48) = 31.83, p < .01).

7

In this and the following statistical analyses, the results did not differ significantly when raw proportion correct scores were analyzed. 8 The purpose of recruiting new participants was to compare the results of Experiment 2 with those of Experiments 5 and 6, in which only one original volunteer could participate.

Random pulse (headphones) 0 1

10

100

Pulse density (/s)

b

1

Proportion correct

3.3. Results

n=7 0.25

3.2. Methods The methods were identical to those in Experiment 1 except for the following aspects. Participants were the two authors and three original and two new paid volunteers.8 The pulse densities were varied from 80/s to 5/s in 9 steps. The audio-visual time lag was fixed at 0 ms for synchronous stimuli and at 250 ms (audition first) for asynchronous stimuli. The results of Experiment 1 (and those of Experiment 5, see below) suggest that with such a large lag, the lag and its direction would have little effect on the performance of synchrony–asynchrony discrimination.

0.5

0.75

0.5

n=5 0.25

Headphones

Speaker

0 1

10

100

Pulse density (/s) Fig. 3. (a) Results of Experiment 2. Proportion correct of synchrony discrimination of audio-visual random pulse trains is plotted as a function of the pulse density. Each data point represents the average for seven participants, with the error bar indicating the standard error. (b) Results of Experiment 4, in which audio-visual stimuli were co-localized by speaker presentation. Each data point represents the average for five participants. Also plotted are the data for the same participants in Experiment 2, where audio stimuli were presented through headphones.

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

4.2. Methods The methods were identical to those in Experiment 1 except for the following aspects. There were two conditions: vision–vision (VV) and audition–audition (AA). The pulse density was 80/s. The stimulus for the VV condition was a luminance-modulated Gaussian blob that was divided into left and right halves separated by a 0.15° gap. The two half-blobs were modulated by the same random pulse train either synchronously or asynchronously (250 ms delay). The task was to judge whether the left and right halves were synchronous or asynchronous. The stimulus for the AA condition was 100% amplitude-modulated white noises (flutters) presented dichotically via headphones. As in the VV condition, flutters presented in the left and right ears were modulated by the same random pulse train synchronously or with a delay of 250 ms. The carrier noises were uncorrelated between the ears. The task of the five participants (same as Experiment 1) was to judge whether the flutters presented to the left and right ears were synchronous or asynchronous. 4.3. Results Synchrony–asynchrony discrimination was nearly perfect both for matching between two visual random pulse trains (VV condition, 100% for all participants) and for that between two auditory random pulse trains (AA condition, 99%, average of five participants). This implies that the temporal resolutions and phase accuracies of peripheral sensory signals are high enough for some within-modal mechanisms to make nearly perfect synchrony discrimination at the temporal density of 80/s. Under the conditions we used, within-modal synchrony– asynchrony detections were likely to be subserved by lowlevel sensors, such as first-order motion detectors for vision and interaural time/level difference detectors for audition. The temporal limit would be lowered if measured under the conditions where those mechanisms do not work effectively. It is known, for instance, that visual temporal phase discrimination is worsened as the inter-element spatial separation is increased (Forte, Hogben, & Ross, 1999; Victor & Conte, 2002), and when the comparison is made between different attributes (Arnold, 2005; Holcombe & Cavanagh, 2001). In these cases, however, the observed temporal limit is likely to reflect the characteristics of the comparison mechanism involved. We conjecture that within-modal synchrony judgments could be as bad as cross-modal judgments when similar feature-matching mechanism are used (Fujisaki & Nishida, 2005b). To minimize contribution of the comparison processes to estimation of the temporal resolution of sensory signals, we used stimulus conditions that were expected to activate the fastest comparison mechanisms. This is because the temporal resolution of the visual/auditory temporal responses feeding to comparison mechanism should be equal to or better than the obtained temporal-frequency thresholds.

1081

5. Experiment 4: Audio-visual synchrony–asynchrony discrimination between sparse and dense random pulse trains 5.1. Introduction The results of Experiment 3 suggest that poor audiovisual synchrony discrimination for high-density random pulse trains does not reflect the temporal properties of within-modal mechanisms. However, given that both visual and auditory systems consist of multiple processes, one could argue that the results of Experiment 3 might not reveal the temporal properties of within-modal signals directly used for audio-visual comparison. Experiment 4 addressed the same problem from a slightly different perspective. We measured the performance of audio-visual synchrony discrimination as a function of the pulse density of one modality, while keeping the other modality stimulus at a sparse density. Let us assume a significant difference in temporal resolution between two modalities (Shimojo & Shams, 2001; Welch & Warren, 1986). If the poor audiovisual synchrony discrimination for high-density random pulse trains reflects the temporal properties of withinmodal mechanisms, the critical parameter would be the density of the modality of lower temporal limitation (vision), and the synchrony discrimination would be affected more by the density change of that modality than by the density change of the other modality. We tested whether such an asymmetry was observed. 5.2. Methods The pulse density for one modality was changed from 80/s to 5/s in 5 steps with the density of the other modality fixed at the sparsest value (5/s). There were three conditions (Fig. 4a): audition fixed (A fixed), vision fixed (V fixed), and both modalities changed (A–V changed). In the V-fixed, or A-fixed condition, participants were instructed to judge whether the pulse components of the lower-density modality had synchronized pairs in the pulse train of the higher-density modality. The audio-visual time lag was 250 ms for asynchronous stimuli. The participants and other methods were identical to those used in Experiment 1. 5.3. Results Fig. 4 shows the results. A two-way analysis of variance for normalized scores indicated that the effect of target density (F(4, 16) = 110.32, p < .01), the effect of modality condition (A fixed, V fixed, A–V changed) (F(2, 8) = 22.30, p < .01), and their interaction (F(8, 32) = 5.48; p < .01) were all significant. For all the modality conditions, the main effect of target density was significant (A fixed: F(4, 16) = 66.42; p < .01; V fixed: F(4, 16) = 68.72; p < .01; A–V changed: F(4, 16) = 32.51; p < .01). For two density conditions (10 and 20/s), the main effect of modality condition was significant (10/s: F(2, 8) = 18.98; p < .01) as a

1082

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

a

A fixed & V changed

V

V fixed & A changed

V



A



A

Time

V A-V changed

Time



A

Time

1

P r o p o r t i o n c o r re c t

b

0.75

0.5

0.25 A fixed & V changed V fixed & A changed A-V changed 0 1

10

100

Pulse density (/s) Fig. 4. Stimuli and results of Experiment 5. (a) Temporal profiles of asymmetric density stimuli. The pulse density for one modality was changed from 80/s to 5/s in five steps while the density of the other modality sparse was fixed at the sparsest value (5/s). There were three conditions: audition fixed (A fixed), vision fixed (V fixed), and both modalities changed (A–V changed). (b) Proportion correct of synchrony discrimination of asymmetric density stimulus is plotted as the function of the pulse density of one modality.

result of significant differences between the A–V changed condition and the other two (Tukey HSD: A-fixed vs. A– V changed: p < .01; V-fixed vs. A–V changed: p < .01; 20/ s: F(2, 8) = 8.98; p < .01; Tukey HSD: A fixed vs. A–V changed: p < .05; V-fixed vs. A–V changed: p < .01). The synchrony–asynchrony discrimination performance gradually decreased as the stimulus density of one modality was increased, with little difference between the A-fixed and V-fixed conditions. The lack of asymmetry in the effects of audio and visual densities suggests that the audio-visual density limit reflects the property of the cross-modal process, rather than that of a within-modal process of slower modality. It remains an open question however as to what sorts of signal transformation make the roles of audio and visual signals very similar in synchrony judgments. It is also worth mentioning that in Fig. 4a, the performance obtained with these asymmetric density stimuli was lower than that obtained with the symmetric density stimuli, even though the density of one modality was higher for the latter case. This may be because extra pulse trains were included only in the denser modality signals. Although these trains did not alter the peak location of a cross-correlation function, they reduced the signal to noise

ratio, by increasing false matching in asynchronous stimuli, and by masking signal pulse in synchronous stimuli. The results suggest that the audio-visual synchrony detection is not robust against such noises. It was also shown that presenting different number of stimuli in each modality reduces the magnitude of a cross-modal interaction (Sanabria, Soto-Faraco, Chan, & Spence, 2004). 6. Experiment 5: The effect of spatial co-localization 6.1. Introduction In most of the experiments reported in this paper, we presented auditory stimuli through headphones in order to present intended waveforms to the participants’ ears with timings that were as precise as possible. With this method, however, spatial positions were not physically co-localized for the two modalities. This could be a potential problem, since spatial co-localization may play an important role in audio-visual integration (Corneil, Van Wanrooij, Munoz, & Van Opstal, 2002; Meredith & Stein, 1986; Meyer et al., 2005), and in synchrony judgments (Spence & Squire, 2003; Zampini, Shore, & Spence, 2003, 2005).

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

6.2. Methods We replicated Experiment 2 (symmetrical density change), with auditory stimuli presented from a speaker placed immediately below the visual stimuli. Participants were the two authors and three (one original and two new) paid volunteers, all of whom participated in Experiment 2. 6.3. Results As shown by Fig. 3b, the results obtained with the speaker presentation were very similar to those obtained for the same participants with headphone presentation in Experiment 2. A two-way analysis of variance indicated that the effect of stimulus density was significant (F(8, 32) = 79.15, p < .01), but that the effect of the auditory presentation method (F(1, 4) = .44, p = .54) and the interaction between the density and the presentation method (F(8, 32) = .37, p = .93) were not significant. This indicates that audio-visual mismatch in spatial location, at least that caused by headphone presentation, did not significantly modulate the effect of pulse density on audio-visual synchrony discrimination. We also found no significant difference between headphone and speaker presentations for synchrony discrimination of repetitive stimuli (Fujisaki & Nishida, 2005b). It should also be noted that audio-visual interactions, such as illusory flashes or auditory driving, occur even when stimuli are not spatially co-localized (e.g., Recanzone, 2003; Shams, Kamitani, & Shimojo, 2000; Shipley, 1964). 7. Experiment 6: Audio-visual subjective synchrony judgments

1083

conditions, ranging from 250 to +250 ms ( 250, 200, 150, 100, 50, 0, 50, 100, 150, 200, 250 ms), appeared 15 times each in random order. Each participant ran two blocks each for three density conditions: 5, 20 and 80/s. 7.3. Results In Fig. 5, the proportion of synchrony responses is plotted as a function of audio-visual time lag. The basic pattern of the results was common for all participants. When there was no audio-visual lag, the proportion of synchrony response was high. This suggests that our participants were able to perceive physically synchronous stimuli as synchronous. As the audio-visual lag was increased in either direction, the proportion of synchrony responses declined. The temporal tuning was relatively sharp for the lowest density (5/s), but broadened as the density was increased, until being nearly flat for the highest density (80/s). This change in tuning width is consistent with the effect of stimulus density on synchrony–asynchrony discrimination. For the density of 80/s, synchrony responses were frequently observed even when there were large lags. This is a reminiscent of the auditory driving effect (Shipley, 1964). Individual variation of the overall level of the proportion of synchrony responses may be due to the difference in the response criterion, which we did not control in this experiment. To conclude, the results of Experiment 6 support our assumption that the perceptual mechanism underlying the subjective perception of audio-visual synchrony can be analyzed from the objective performance of synchrony–asynchrony discrimination. 8. Experiment 7: Audio-visual synchrony–asynchrony discrimination for high-cut random pulse trains

7.1. Introduction 8.1. Introduction As noted at the end of Section 1, we measured the objective performance of synchrony–asynchrony discrimination to reveal the basic perceptual mechanism underlying the subjective perception of audio-visual synchrony. However, our participants could perform the discrimination task even if they had no subjective impression of synchrony for physically synchronous stimuli, since they knew the direction of the asynchrony of that block beforehand and were given feedback for each trial. To address the concern that the discrimination performance we measured might have little to do with subjective synchrony perception, we examined audio-visual subjective simultaneity for random pulse trains. 7.2. Methods In each trial, an audio-visual random pulse train, the same as that used in the other discrimination experiments, was presented with a given audio-visual delay. Five participants (same as Experiment 5) make a yes–no judgment about whether the stimulus appeared to be in synchrony. No feedback was given. Within the same block, 11 delay

In the case of visual motion perception, it is known that motion detection is not difficult for spatially dense random dot kinematograms even when the stimuli do not contain trackable salient features (Braddick, 1974; Julesz, 1971). Motion studies ascribe this perceptual ability to the operation of low-level motion sensors (Braddick, 1974, 1980; Nakayama, 1985). Motion studies also consider that when the dot density of a random dot kinematogram is reduced and local salient features are made trackable, a higher-level feature-matching mechanism, in addition to low-level sensors, contributes to motion detection (Sato, 1999). In this framework, the difficulty of detecting audio-visual synchrony with the analogous dense-random pulse trains can be interpreted as evidence against the existence of low-level sensors for audio-visual synchrony detection. Given that individuation of single pulses, pulse chunks, or brief pauses becomes easier as the pulse density is reduced, the effect of pulse density is consistent with the idea that audio-visual subjective temporal synchrony is established by a higherlevel salient feature-matching mechanism.

1084

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

5 /s

20 /s

80 /s

1

1

AM

AS

0.75

0.75

0.5

0.5

0.25

0.25

0 -300 -200 -100

0

100

200

300

Proportion of synchrony response

1

0 -300 -200 -100

0

100

SN

0.75

0.75

0.5

0.5

0.25

0.25

0

100

200

300

1

0 -300 -200 -100

0

100

200

300

1

WF

mean

0.75

0.75

0.5

0.5

0.25

0.25

0 -300 -200 -100

300

1

MS

0 -300 -200 -100

200

0

100

200

300

0 -300 -200 -100

0

100

200

300

Time lag (ms) Fig. 5. Results of Experiment 6. Proportion of synchrony responses is plotted as a function of audio-visual time lag separately for five participants and for their mean. Different symbols represent the results obtained with different pulse density. Open symbols plotted at 250-ms lag represent the proportion correct of synchrony–asynchrony discrimination for each pulse density.

Our prior study using repetitive pulse trains showed that audio-visual synchrony–asynchrony discrimination deteriorated at temporal frequencies beyond 4 Hz (Fujisaki & Nishida, 2005b). This is consistent with the present finding of a failure of audio-visual synchrony–asynchrony at high temporal densities. However, it should be noted that the implications of the two findings are significantly different.

That is, on the basis of only the low temporal-frequency limit (Fujisaki & Nishida, 2005b), one cannot reject the hypothesis that the audio-visual subjective temporal synchrony is detected by a low-level sensor tuned to low temporal frequencies, a mechanism analogous to low-level motion sensors tuned to low spatial frequencies. On the other hand, the random pulse stimuli used in this study

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

8.2. Methods High-temporal-frequency components of audio-visual signals were filtered out. The cut-off frequencies were varied from 1 to 16 Hz in nine steps (1, 1.5, 2, 3, 4, 6, 8, 11.5, and 16 Hz, respectively). The pulse modulation was transferred into the frequency domain by the fast Fourier transform (FFT), filtered by a hatbox high-cut filter, then transferred back to the time domain by the inverse FFT. While this procedure allows a non-causal influence of the filtering, it introduces no relative phase shift. No compensation was made for the signal amplitudes across different cut-off-frequency conditions. This resulted in a reduction in total power as the cut-off temporal frequency was lowered. The audio-visual time lag was 250 ms for asynchronous stimuli. The participants and other methods were identical to those used in Experiment 1. 8.3. Results When the cut-off frequency was high (11.5 and 16 Hz), the stimulus was not very different from the original random pulse, and synchrony–asynchrony discrimination was hard. The discrimination was also difficult when the cut-off frequency was very low (2 Hz or below). However, when the cut-off frequency was intermediate (3–6 Hz), the discrimination performance was well above the chance level (Fig. 6). A one-way analysis of variance for normalized scores indicated the effect of cut-off frequency was significant (F(8, 32) = 5.86, p < .01). Tukey’s HSD test indicated significant differences between 1 and 3 Hz (p < .01), 1 and 4 Hz (p < .01), 1 and 6 Hz (p < .01), 2 and 3 Hz (p < .05), 2 and 4 Hz (p < .05), and 2 and 6 Hz

1

Proportion correct

had a broadband frequency spectrum. The temporal density affects the total stimulus power, but it does not affect the flat power spectrum. Therefore, the present finding indicates that it is the stimulus temporal density, not the temporal frequency, that limits the performance of audiovisual synchrony perception, and thus provides clear evidence against the low-pass low-level sensor hypothesis. This conclusion is also supported by our previous findings that audio-visual synchrony judgment is harder for sine waves than for repetitive pulse trains, and that low-frequency-cut filtering does not impair the audio-visual temporal synchrony judgment for single pulse stimuli (Fujisaki & Nishida, 2005b). However, there remains a possible counter argument from the low-pass low-level sensor hypothesis. That is, although dense random pulse trains contained broadband frequency components, the power of the low-frequency components might not have been strong enough to support synchrony detection. To check this possibility, we applied high-cut (low-pass) filters to the dense (80/s) random pulse trains. If synchrony–asynchrony discrimination is possible for high-cut stimuli, we can infer that our stimuli contained sufficient power in low-frequency modulations.

1085

0.75

0.5

0.25

Highcut random pulse 0 1

10

Upper-cutoff frequency (Hz) Fig. 6. Results of Experiment 7. Proportion correct of synchrony discrimination of low-pass filtered audio-visual random pulse trains is plotted as a function of the upper cut-off frequency.

(p < .05). A paired t-test further indicated that the normalized score was significantly different from zero (i.e., 50%) at cut-off frequencies of 3, 4, 6 and 8 Hz. This suggests that even though the power of the low-frequency components in the dense random pulse trains was strong enough to support synchrony perception,9 the original unfiltered random pulse stimuli do not afford audio-visual synchrony discrimination. If there are low-pass synchrony sensors, and they support synchrony discrimination for moderately high-cut random pulse trains, they should also work for unfiltered stimuli. Therefore, the results of Experiment 7 do not support the existence of low-pass sensors. We conjecture that the synchrony–asynchrony discrimination for moderately high-cut random pulse trains is also mediated by a feature matching mechanism. One reason the percent corrects were not very high may be that temporal matching features were not well defined in gradually changing high-cut stimuli. In agreement with this idea, our previous study (Fujisaki & Nishida, 2005b) showed that the synchrony–asynchrony discrimination was significantly harder for sine wave modulations than for repetitive pulse trains. 9. Experiment 8: Audio-visual synchrony–asynchrony discrimination between sparse figure pulses embedded in dense background random pulse trains 9.1. Introduction The next experiment tested the notion of ‘‘salient’’ feature matching by manipulating the stimulus saliencies by color and pitch differences (Bregman & Achim, 1973; 9 Our argument would not be affected even if non-linearity in perceptual processing adds high-frequency components to the neural response to our high-cut stimuli, since this would occur only when the power of the lowfrequency components is strong.

1086

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

9.2. Methods

D’Zmura, 1991; Kooi, Toet, Tripathy, & Levi, 1994). Distinctive audio-visual stimuli (figure pulse trains) made of red flashes and high-pitch pips were embedded in random pulse trains (background pulse) made of white flashes and low-pitch pips (Fig. 7a). We varied the density of the figure pulse while keeping the total stimulus density constant. We expected that when the figure pulse density was low, the participants would be able to perceive the figure pulse train (or at least a portion of it) as a sequence of salient features by means of bottomup segmentation and/or top-down attentive selection. If the matching process could selectively use a sparse figure pulse train added to a dense pulse train, synchrony perception would be significantly facilitated even when the total stimulus density was too high to reliably judge audio-visual synchrony.

In synchronous pairs, both figure and background pulse trains were synchronized. The visual stimulus was a Gaussian blob (20 cd/m2 at peak) presented against a dark field (0 cd/m2), whose color was red for target and white for background, respectively. The auditory stimulus was 100% amplitude-modulated pure tones (about 60 dB SPL at the peak of modulation) whose pitch was high (2349 Hz, ‘‘D7’’ in musical note terminology) for figure and low (523 Hz, ‘‘C5’’ in musical note terminology) for background, respectively. Sound levels were compensated across different career frequencies. The participants knew beforehand which colors should correspond to which pitches in synchronous stimuli. The density of figure pulse trains was varied from 5/s to 80/s in five steps, while the density of

a Control (only red flashes / high-pitch pips) V

V

Time

Time

A

A 523 Hz (C5)

b

2349 Hz (D7)

P r o p o r ti o n c o r r e c t

1

0.75

0.5

0.25 Red embedded in white Red only 0 1

10

100

Pulse density (/s) Fig. 7. Stimuli and results of Experiment 8. (a) Schematic illustration of the stimuli in which sparse distinctive audio-visual pulses (figure) were embedded in dense random pulse trains (background). The visual stimulus was a Gaussian blob whose color was red for target and white for background, respectively. The auditory stimulus was 100% amplitude-modulated pure tones whose pitch was high (2349 Hz, ‘‘D7’’ in musical note terminology) for figure and low (523 Hz, ‘‘C5’’ in musical note terminology) for background, respectively. The density of figure pulse trains was varied from 5/s to 80/s in five steps, while the density of the total pulse train (figure + background) was always 80/s, In the control condition, the stimulus containing only figure pulse (consisting of red flashes and high pitch pips) was presented. (b) Proportion correct of synchrony–asynchrony discrimination for the figurebackground mixed stimulus, together with that for the control-figure-only stimulus, is plotted as a function of the figure pulse density.

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

the total pulse train (figure + background) was always 80/s. In the control condition, the stimulus containing only figure pulse trains (consisting of red flashes and high pitch pips) was presented. The audio-visual time lag was 250 ms for asynchronous stimuli. The participants and other methods were identical to those used in Experiment 1. 9.3. Results Fig. 7b shows the percentage of synchrony–asynchrony discrimination for the figure-background mixed stimulus, together with that for the control-figure-only stimulus, as a function of the figure pulse density. A two-way analysis of variance indicated that the effect of stimulus condition (F(1, 4) = 81.79; p < .01), the effect of target pulse density (F(4, 16) = 33.37; p < .01) and their interaction (F(4, 16) = 16.64; p < .01) were significant. The main effect of target pulse density was significant for both stimulus conditions (figure-background mixed: F(4, 16) = 9.83; p < .01; control: F(4, 16) = 34.61; p < .01). The main effect of stimulus condition was significant for lower three densities (5/s: F(1, 4) = 128.16, p < .01; 10/s: F(1, 4) = 21.87, p < .01; 20/ s: F(1, 4) = 18.27, p < .05). Although the performance for the mixed stimulus was at the chance level when the figure pulse density was the highest (the stimulus that consisted only of a dense figure pulse train), it was gradually improved as the figure density was reduced. This is consistent with our hypothesis that audio-visual synchrony detection is not limited by the raw stimulus density but by the density of salient temporal features. However, the performance for the mixed stimulus was not as good as that for the figure-only stimulus. This is probably because some red flashes were sensory-masked by white flashes (Breitmeyer, 1984), because color/pitch difference could not perfectly break the temporal crowding effect under our stimulus condition, and/or because attentional selection based on differences in color and/or in pitch was not perfect for rapid stimulus sequences under our stimulus condition. One might suspect that the same results would be obtained if there are low-level sensors that selectively respond to the combination of a red flash and a high-pitch pip. However, if we assume sensors tuned to the specific combination audio-visual stimuli that we happened to choose, we may also have to assume infinite numbers of synchrony detectors for various combinations of audiovisual stimuli. We rather think it more parsimonious to assume that a small number of mechanisms compare some abstract representations, such as chromatic or pitch singletons, or more feature-invariant ‘‘salience’’, as in the case of inter-attribute apparent motion (Cavanagh et al., 1989; Lu & Sperling, 1995a, 1995b; Lu & Sperling, 2001).10

10 It should also be noted that while the detectors underlying chromatic motion perception are still a matter of debate (e.g.,Dobkins & Albright, 2004), salient-feature tracking is considered to be one of main mechanisms (Lu, Lesmes, & Sperling, 1999; Seiffert & Cavanagh, 1999).

1087

10. Experiment 9: Audio-visual synchrony–asynchrony discrimination between repetitive stimulus defined by various features 10.1. Introduction In the argument of the last paragraph, we assume that audio-visual synchrony is detectable regardless of the features that define audio and visual stimuli. This featureindependence of audio-visual synchrony perception is suggested not only by our daily experiences, but also by a follow-up experiment of our previous study that measured the upper temporal frequency limit of discriminating synchrony–asynchrony for repetitive audio-visual stimuli (Fujisaki & Nishida, 2005b). Considering the importance of the notion of feature independency for our hypothesis, here we report the results of this follow-up experiment as supplementary data. 10.2. Methods Participants were the two authors. The apparatus was the same as in the other experiments except that the sound was presented by speakers attached to both sides of the CRT monitor. The waveform of the stimuli was a triangle wave, not a random pulse train (Fig. 8a). A stimulus presentation of a trial lasted for 6 s, including 2-s cosine ramps at the onset and the offset. The temporal frequency was variable between blocks, and the temporal relationship of audiovisual stimuli for each trial was either in-phase (synchronous) or 180° out-of-phase (asynchronous). As in the other experiments, the participants made a binary response about stimulus synchrony. To prevent judgments based on stimulus onset, we changed the stimulus phase from a random value to the intended angle (0° or 180°) during a 2-s onset ramp. There were three visual stimuli and three auditory stimuli (Fig. 8b). Visual stimulus 1 (V1): Expansion/contraction of a concentric sinusoidal luminance-modulated grating. The spatial frequency was 1 c/radius, and the contrast was 50%. The radius was changed between 2.26° and 6.78° by a linear-scale triangle wave Fig. 9. Visual stimulus 2 (V2): Clockwise/counterclockwise rotation of a radial sinusoidal grating. The size was 14.99° in diameter. The spatial frequency was 6 c/360°, and the contrast was 50%. The rotation angle was changed by a linear-scale triangle wave while keeping the maximum angle change speed at 60°/s. Visual stimulus 3 (V3): Horizontal translation of a rectangle subtending 5.02° in height and 12.56° in width. We gave the rectangle a blurred horizontal luminance profile (a period of cosine function, peak increment contrast: 100%) to reduce motion aliasing. The horizontal position of the rectangle was changed by a linear-scale triangle function between 19.33° left and right from the center.

1088

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

a

0

1

2

3

4

5

6

Time (s)

b

A1: AM

V1: Expansion/ Contraction

Amplitude

1.0

0.0

-1.0

A2: FM

Frequency (Hz)

V2: Rotation

4k 3k 2k 1k 0k

A3: Translation

V3: Translation

Amplitude

1.0

Left Right

0.0

-1.0

Time Fig. 8. Stimuli of Experiment 9. (a) The waveform of the stimulus was a triangular wave lasting for 6 s, including 2-s cosine ramps at the onset and the offset. (b) An audio-visual stimulus was a combination of one of three visual stimuli and one of three auditory stimuli. The visual stimuli were expansion/ contraction of a concentric grating (V1), clockwise/counterclockwise rotation of a radial grating (V2) and leftward/rightward translation of a rectangle with blurred side edges (V3). Auditory stimuli were amplitude modulation (AM) of a triangle wave (A1), frequency modulation (FM) of a pure tone (A2), and leftward/rightward translation of a white noise (A3).

Auditory stimulus 1 (A1): Amplitude modulation (AM) of a 440-Hz triangle wave between 54 dB SPL and 63.5 dB SPL. A linear change in amplitude generated in a squared change in power. Auditory stimulus 2 (A2): Frequency modulation (FM) of a pure tone between 440 and 880 Hz by a log-scale triangle wave. The sound pressure level was 60 dB (A). Auditory stimulus 3 (A3): Anti-phased AM of white noses from left and right speakers simulating the lateral position shift of the sound. The amplitude of the sound from each speaker was modulated by a linear-scale triangle wave, with 180° phase difference between the two speakers.

The sum of the amplitudes of the left and right sounds was kept constant. The overall level of the signals from the two speakers was 60 dB SPL. We measured the modulation temporal frequency tuning of synchrony–asynchrony discrimination for all nine combinations of these audio-visual stimuli. 10.3. Results The limit of discriminating in-phase from out-of-phase stimuli was about 3–5 Hz regardless of the audio-visual stimulus combinations, including the V1–A1 pair, and

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

Proportion correct

a

11. Discussion

1

0.75 V1A1 V1A2

0.5

V1A3 V2A1 V2A2 V2A3

0.25

1089

V3A1

WF

V3A2 V3A3freq

0 1

10

Temporal frequency (Hz)

b WF

V1A1 V1A2

SN V1A3 V2A1 V2A2 V2A3 V3A1 V3A2 V3A3

1

10

75% correct frequency (Hz) Fig. 9. Results of Experiment 9. (a) Proportion correct of synchrony– asynchrony (in-phase vs. out-of-phase) discrimination plotted as a function of the temporal frequency of the stimulus modulation, for nine combinations of audio-visual stimuli. Continuous lines are the best-fit logistic functions (Fujisaki & Nishida, 2005b). Participant: WF. (b) Threshold temporal frequency (75% correct point estimated from the fitted function) of synchrony–asynchrony discrimination by the two participants for nine stimulus conditions.

V3–A3 pair, which were designed to simulate situations where synchronous audio-visual stimuli came from the same physical event (Fig. 7). Note also that these temporal limits were close to those obtained with visual and auditory pulse trains (Fujisaki & Nishida, 2005b). Although the number of participants was too small to discuss significance of small differences between stimulus conditions, these results suggest that the temporal property of audiovisual synchrony–asynchrony discrimination is relatively independent of the types of stimuli and their combinations. This feature-independence is consistent with the idea that audio-visual synchrony perception is mediated by a salient-feature matching mechanism.

This study tried to clarify the mechanisms underlying the subjective audio-visual temporal synchrony perception. Specifically, we attempted to determine whether the subjective audio-visual temporal synchrony is detected by lowlevel specialized sensors or by higher-level feature-matching mechanisms, with assuming that these hypothetical low- and high-level mechanisms are analogous to those proposed for visual motion detection (Braddick, 1974; Lu & Sperling, 2001). We expected that if low-level sensors exist for audio-visual temporal synchrony detection, audio-visual temporal synchrony–asynchrony discrimination would be possible even for dense random pulse trains that do not contain trackable salient features. Our results however did not support this prediction (Experiments 1 and 2). The difficulty in discriminating synchrony of dense random pulse trains could not be ascribed to a temporal limit of peripheral visual or auditory mechanism (Experiments 3 and 4), nor to the lack of spatial co-localization of audio-visual stimuli (Experiment 5). No dissociation was found between objective synchrony–asynchrony discrimination performance and subjective synchrony perception (Experiment 6). The temporal limiting factor of audiovisual synchrony discrimination is neither the temporal frequency of the stimulus nor the physical density of the stimulus, but the temporal density of salient features (Experiments 7 and 8). The audio-visual synchrony discrimination is relatively independent of the types of the stimuli and their combinations (Experiment 9). These results support the hypothesis that audio-visual temporal synchrony perception is established solely by a salient-feature matching process. The feature-matching mechanism seems to be an effective solution to the combinational explosion and false matching problems between audio-visual signals. It allows matching for only the events that are likely to have causal relationships and ignores an enormous number of other unrelated audio-visual events. The sparse temporal density limit to extract salient temporal features is probably not something restricted to cross-modal (audio-visual) comparisons. Similar temporal limits were also observed for within-modal (within-visual), but cross-attribute comparisons to which local specialized sensors may not contribute. Holcombe and Cavanagh (2001) reported that the upper temporal limit for pairing spatially separate color and orientation is slower than 3 Hz. It has also been reported that the temporal frequency limit to take correspondence between attributes is of the order of several hertz in the case of color and motion (Arnold, 2005) or spatially separated luminance modulation and luminance modulation (Victor & Conte, 2002). We conjecture that salient feature-matching may be a common principle of mid-level temporal binding. Another line of evidence that shows that audio-visual temporal synchrony detection is established at a higher level was found in an experiment involving a visual search

1090

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

of a target changing in synchrony with an auditory signal (Fujisaki et al., 2006). The results indicate that audio– visual perceptual synchrony is judged by means of a serial process. This is also in line with the recent finding by Alsius, Navarra, Campbell, and Soto-Faraco (2005) that attention is also required for the occurrence of the classic McGurk illusion (McGurk & MacDonald, 1976) which contradicts the widely held belief that cross-modal speech integration is automatic. However, according to our hypothesis, audio–visual binding is not always attentiondemanding. In environments where the number of events is small, early ‘bottom-up’ segmentation processes for each modality can unambiguously extract corresponding audio and visual signals as salient features. Hypothetically, in these circumstances, an attentional perceptual process can detect audio-visual synchrony without consuming attentional resources. Vatakis, Bayliss, Zampini, and Spence (in press) found that the temporal discrimination performance for a target audio-visual pair, evaluated in terms of the JND (just noticeable differences) of temporal order judgments, is significantly impaired when the target is embedded in the middle of synchronous audio-visual distractors. They also found that the performance is somewhat improved by letting the participants know the position of the target or by making the target distinctive by changing color and pitch as we did in Experiment 8. In agreement with our salient-feature matching hypothesis, Vatakis et al. interpreted these findings to imply that the distractors impair synchrony discrimination by reducing the saliency of the target stimuli through a temporal crowding effect that can be reduced to some extent by top-down or bottom-up attention to the target. Vatakis and Spence (2006c) also reported that the crowding effect induced by bimodal audio-visual distractors was much larger than that induced by unimodal visual or auditory distractors. Although this finding suggests that the crowding effect may have an origin in cross-modal processing, Experiment 4 of this study suggests that the addition of task-irrelevant unimodal pulse trains could significantly impair audio-visual synchrony discrimination at least when the temporal density of additional signal is high. Even though our data do not support the existence of specialized low-level audio-visual synchrony detectors that effectively detect audio-visual temporal synchrony, one may suspect that there may be some low-level detectors for synchrony of audio-visual attribute pairs bound in the natural environment, such as audio-visual looming (Kitagawa & Ichihara, 2002; Maier, Neuhoff, Logothetis, & Ghazanfar, 2004), co-localized moving object (Meyer et al., 2005), face movements and voice (Ghazanfar, Maier, Hoffman, & Logothetis, 2005; Munhall & Vatikiotis-Bateson, 2004), and pitch and visual height (Evans & Treisman, 2005; Maeda, Kanai, & Shimojo, 2004). Although we cannot completely exclude this possibility, Experiment 9 of this study did not indicate significant improvements in audiovisual synchrony–asynchrony discrimination even under

the situation where we simulated looming (V1–A1 condition) and position shift (V3–A3 condition). In relation to this issue, Vatakis and Spence (2006a, 2006b) examined the JND of audio-visual temporal order judgments for music, speech and object actions. They concluded that the performance is more dependent on stimulus complexity than on the event type. That is, cross-modal temporal discrimination performance is better for audio-visual stimuli of lower complexity (e.g., syllables) as compared to stimuli having continuously varying properties (words and/or sentences) (Vatakis & Spence, 2006a). This conclusion agrees well with our hypothesis given that the temporal density of salient features is a good index of stimulus complexity. Arrighi, Alais, and Burr (2006) also found no improvement in audio-visual synchrony perception when the stimulus was real and biological as compared to when the stimulus was artificial. Whereas Fujisaki and Nishida (2005b) found that the audio-visual synchrony discrimination for repetitive pulse trains is greatly impaired beyond 4 Hz, Arrighi et al. (2006) found that the audio-visual synchrony judgment for drumming stimuli is not significantly impaired even when the drum rhythm varied randomly from 4 to 11.2 Hz (Experiment 3, fast series). The present study suggests a clue to resolve this apparent discrepancy. Given that the temporal limit of audio-visual synchrony perception is determined by the density of salient features, not by the physical temporal frequency of the stimuli, it is possible that the pattern of random frequency modulation may have provided sparse salient higher-order features, such as those included in amplitude or frequency modulation of drumming, for audio-visual matching. In summary, we consider that our model drawn mainly from audio-visual synchrony perception for artificial stimuli is generally applicable to synchrony perception of natural and/or complex events. Our model, however, is still descriptive and the quantitative prediction of the present data, including the effect of pulse density on audio-visual synchrony perception, awaits further investigation. One remaining problem is to specify the matching algorithm for audio-visual synchrony perception. We consider it likely to be a neural cross-correlator similar to those involved in the detection of visual motion (van Santen & Sperling, 1985), binocular stereo (Banks, Gepshtein, & Landy, 2004) and binaural sound matching (Jeffress, 1948), at least as a first approximation. This view is consistent with findings that lag adaptation induces recalibration of audio-visual synchrony perception that looks functionally analogous to, e.g., the motion aftereffect (Fujisaki et al., 2004; Vroomen, Keetels, de Gelder, & Bertelson, 2004) and that increment discrimination of audio-visual asynchrony shows a dipper function that is qualitatively similar to those obtained with within-modal stimuli (Banks et al., 2006). Another and more serious limitation of the present model is that it does not specify how salient temporal features are computed from sensory inputs. It would be

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

interesting in future to test models of temporal feature extraction analogous to the models proposed for extraction of higher-order features in spatial vision (Chubb & Sperling, 1988; Itti & Koch, 2001; Landy & Bergen, 1991; Li, 2002; Morgan & Watt, 1997). It should be also noted that the notion of ‘‘feature’’ is not well defined even in spatial vision, and that there is yet no satisfactory model of figure-ground segregation and stream segregation, based on which the notion of salience is defined (Lu & Sperling, 2001). Correct modeling of salient feature extraction would require further clarification of mid- and high-level sensory processing where the representations might be more symbolic than those in lower-level sensory processing (Marr, 1982). It should also be noted that analyzing higher-level representations is not easy in pure vision and hearing research, since various mechanisms included in multiple processing stages could contribute to psychophysical performance, but if we are correct in thinking that audiovisual synchrony judgments are based purely on a highlevel feature matching mechanism, cross-modal study could be a useful paradigm to selectively analyze high-level sensory representations. This is why cross-modal research could be an important topic even for the pure-vision research community. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299. Alais, D., & Burr, D. (2004). No direction-specific bimodal facilitation for audiovisual motion detection. Brain Research Cognition Brain Research, 19(2), 185–194. Alsius, A., Navarra, J., Campbell, R., & Soto-Faraco, S. (2005). Audiovisual integration of speech falters under high attention demands. Current Biology, 15(9), 839–843. Arnold, D. H. (2005). Perceptual pairing of colour and motion. Vision Research, 45(24), 3015–3026. Arrighi, R., Alais, D., & Burr, D. (2006). Perceptual synchrony of audiovisual streams for natural and artificial motion sequences. Journal of Vision, 6(3), 260–268. Banks, M., Burr, D., & Morrone, C. (2006). Auditory-visual temporal discrimination: Evidence for usage of a temporal cross-correlator? International Multisensory Research Forum, Dublin, Ireland. Banks, M. S., Gepshtein, S., & Landy, M. S. (2004). Why is spatial stereoresolution so low? Journal of Neuroscience, 24(9), 2077–2089. Bertelson, P., Vroomen, J., de Gelder, B., & Driver, J. (2000). The ventriloquist effect does not depend on the direction of deliberate visual attention. Perception & Psychophysics, 62(2), 321–332. Braddick, O. (1974). A short-range process in apparent motion. Vision Research, 14(7), 519–527. Braddick, O. J. (1980). Low-level and high-level processes in apparent motion. Philosophical Transactions of the Royal Society of London B Biological Science, 290(1038), 137–151. Bregman, A. S., & Achim, A. (1973). Visual stream segregation. Perception & Psychophysics, 13(3), 451–454. Breitmeyer, B. G. (1984). Visual masking: An integrative approach. London: Oxford University Press. Cavanagh, P. (1991). Short-range vs. long-range motion: not a valid distinction. Spatial Vision, 5(4), 303–309. Cavanagh, P. (1992). Attention-based motion perception. Science, 257(5076), 1563–1565.

1091

Cavanagh, P., Arguin, M., & von Gru¨nau, M. (1989). Interattribute apparent motion. Vision Research, 29(9), 1197–1204. Cavanagh, P., & Mather, G. (1989). Motion: the long and short of it. Spatial Vision, 4(2–3), 103–129. Chang, J. J., & Julesz, B. (1983). Displacement limits for spatial frequency filtered random-dot kinematograms in apparent motion. Vision Research, 23(12), 1379–1385. Chubb, C., & Sperling, G. (1988). Drift-balanced random stimuli: a general basis for studying non-Fourier motion perception. Journal of the Optical Society of America A, 5(11), 1986–2007. Corneil, B. D., Van Wanrooij, M., Munoz, D. P., & Van Opstal, A. J. (2002). Auditory-visual interactions subserving goal-directed saccades in a complex scene. Journal of Neurophysiology, 88(1), 438–454. D’Zmura, M. (1991). Color in visual search. Vision Research, 31(6), 951–966. Dawson, M. R. (1991). The how and why of what went where in apparent motion: Modeling solutions to the motion correspondence problem. Psychological Review, 98(4), 569–603. Dixon, N. F., & Spitz, L. (1980). The detection of auditory visual desynchrony. Perception, 9(6), 719–721. Dobkins, K. R., & Albright, T. D. (2004). Merging processing streams: Color cues for motion detection and interpretation. In L. M. Chalupa & J. S. Werner (Eds.), The visual neuroscience. Cambridge, MA: The MIT Press. Evans, K. K., & Treisman, A. (2005). Crossmodal binding of audio-visual correspondent features [Abstract]. Journal of Vision, 5(8), 874a. Forte, J., Hogben, J. H., & Ross, J. (1999). Spatial limitations of temporal segmentation. Vision Research, 39(24), 4052–4061. Fujisaki, W., Koene, A., Arnold, D., Johnston, A., & Nishida, S. (2006). Visual search for a target changing in synchrony with an auditory signal. Proceedings of the Biological Sciences, 273(1588), 865–874. Fujisaki, W., & Nishida, S. (2005a). Feature-based, post-attentive processing for temporal synchrony perception of audiovisual signals revealed by random pulse trains. In 6th Annual meeting of the International Multisensory Research Forum, Rovereto, Italy. Fujisaki, W., & Nishida, S. (2005b). Temporal frequency characteristics of synchrony–asynchrony discrimination of audio-visual signals. Experimental Brain Research, 166(3–4), 455–464. Fujisaki, W., Shimojo, S., Kashino, M., & Nishida, S. (2004). Recalibration of audiovisual simultaneity. Natural Neuroscience, 7(7), 773–778. Gebhard, J. W., & Mowbray, G. H. (1959). On discriminating the rate of visual flicker and auditory flutter. American Journal of Psychology, 72, 521–529. Ghazanfar, A. A., Maier, J. X., Hoffman, K. L., & Logothetis, N. K. (2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience, 25(20), 5004–5012. Holcombe, A. O., & Cavanagh, P. (2001). Early binding of feature pairs for visual perception. Natural Neuroscience, 4(2), 127–128. Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Review of Neuroscience, 2(3), 194–203. Jeffress, L. A. (1948). A place theory of sound localization. Journal of the Comparative Physiology and Psychology, 41, 35–39. Julesz, B. (1971). Foundations of cyclopian perception. Chicago, IL: University of Chicago Press. Kelly, D. H. (1979). Motion and vision. I. Stabilized images of stationary gratings. Journal of the Optical Society of America, 69(9), 1266–1274. Kitagawa, N., & Ichihara, S. (2002). Hearing visual motion in depth. Nature, 416(6877), 172–174. Kooi, F. L., Toet, A., Tripathy, S. P., & Levi, D. M. (1994). The effect of similarity and duration on spatial interaction in peripheral vision. Spatial Vision, 8(2), 255–279. Landy, M. S., & Bergen, J. R. (1991). Texture segregation and orientation gradient. Vision Research, 31(4), 679–691. Lebib, R., Papo, D., de Bode, S., & Baudonniere, P. M. (2003). Evidence of a visual-to-auditory cross-modal sensory gating phenomenon as reflected by the human P50 event-related brain potential modulation. Neuroscience Letters, 341(3), 185–188.

1092

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093

Li, Z. (2002). A saliency map in primary visual cortex. Trends in Cognition Science, 6(1), 9–16. Liu, L., Stevenson, S. B., & Schor, C. M. (1994). Quantitative stereoscopic depth without binocular correspondence. Nature, 367(6458), 66–69. Lu, Z. L., Lesmes, L. A., & Sperling, G. (1999). The mechanism of isoluminant chromatic motion perception. Proceedings of the National Academy of Sciences of the United States of America, 96(14), 8289–8294. Lu, Z. L., & Sperling, G. (1995a). Attention-generated apparent motion. Nature, 377(6546), 237–239. Lu, Z. L., & Sperling, G. (1995b). The functional architecture of human visual motion perception. Vision Research, 35(19), 2697–2722. Lu, Z. L., & Sperling, G. (2001). Three-systems theory of human visual motion perception: review and update. Journal of the Optical Society of America A Optical Image Science Vision, 18(9), 2331–2370. Maeda, F., Kanai, R., & Shimojo, S. (2004). Changing pitch induced visual motion illusion. Current Biology, 14(23), R990–R991. Maier, J. X., Neuhoff, J. G., Logothetis, N. K., & Ghazanfar, A. A. (2004). Multisensory integration of looming signals by rhesus monkeys. Neuron, 43(2), 177–181. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: W.H. Freeman and Company. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748. Meredith, M. A., & Stein, B. E. (1986). Spatial factors determine the activity of multisensory neurons in cat superior colliculus. Brain Research, 365(2), 350–354. Meyer, G. F., Wuerger, S. M., Rohrbein, F., & Zetzsche, C. (2005). Lowlevel integration of auditory and visual motion signals requires spatial co-localisation. Experimental Brain Research, 166(304), 538–547. Morgan, M. J., & Watt, R. J. (1997). The combination of filters in early spatial vision: a retrospective analysis of the MIRAGE model. Perception, 26(9), 1073–1088. Munhall, K., & Vatikiotis-Bateson (2004). Spatial and temporal constraints on audiovisual speech perception. In G. A. Calvert, C. Spence, & B. E. Stein (Eds.), The handbook if multisensory processing. Cambridge, MA: MIT Press. Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58(3), 351–362. Musacchia, G., Sams, M., Nicol, T., & Kraus, N. (2006). Seeing speech affects acoustic information processing in the human brainstem. Experimental Brain Research, 168(1–2), 1–10. Nakayama, K. (1985). Biological image motion processing: A review. Vision Research, 25(5), 625–660. Nishida, S., & Ashida, H. (2001). A motion aftereffect seen more strongly by the non-adapted eye: evidence of multistage adaptation in visual motion processing. Vision Research, 41(5), 561–570. Nishida, S., & Fujisaki, W. (2005). Comparison of audiovisual temporal synchrony perception with visual motion perception suggests a general feature-matching model for cross-attribute binding. In European conference on visual perception, A Corun˜a, Spain. Nishida, S., Ledgeway, T., & Edwards, M. (1997). Dual multiple-scale processing for motion in the human visual system. Vision Research, 37(19), 2685–2698. Nishida, S., & Sato, T. (1992). Positive motion after-effect induced by bandpass-filtered random-dot kinematograms. Vision Research, 32(9), 1635–1646. Nishida, S., & Sato, T. (1995). Motion aftereffect with flickering test patterns reveals higher stages of motion processing. Vision Research, 35(4), 477–490. Odgaard, E. C., Arieh, Y., & Marks, L. E. (2003). Cross-modal enhancement of perceived brightness: sensory interaction versus response bias. Perception & Psychophysics, 65(1), 123–132. Odgaard, E. C., Arieh, Y., & Marks, L. E. (2004). Brighter noise: sensory enhancement of perceived loudness by concurrent visual stimulation. Cognitive Affect and Behavioural Neuroscience, 4(2), 127–132.

Ramachandran, V. S., Rao, V. M., & Vidyasagar, T. R. (1973). The role of contours in stereopsis. Nature, 242(5397), 412–414. Recanzone, G. H. (2003). Auditory influences on visual temporal rate perception. Journal of Neurophysiology, 89(2), 1078–1093. Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Sanabria, D., Soto-Faraco, S., Chan, J. S., & Spence, C. (2004). When does visual perceptual grouping affect multisensory integration? Cognitive Affect and Behavioural Neuroscience, 4(2), 218–229. Sato, T. (1999). Dmax: Relations to low- and high-level motion processes. In T. Watanabe (Ed.), High-level motion processing: Computational, neurobiological, and psychophysical perspectives (pp. 115–151). Cambridge: MA: MIT Press. Schroeder, C. E., & Foxe, J. (2005). Multisensory contributions to lowlevel, ‘unisensory’ processing. Current Opinion in Neurobiology, 15(4), 454–458. Seiffert, A. E., & Cavanagh, P. (1999). Position-based motion perception for color and texture stimuli: effects of contrast and speed. Vision Research, 39(25), 4172–4185. Sekuler, R., Sekuler, A. B., & Lau, R. (1997). Sound alters visual motion perception. Nature, 385(6614), 308. Shams, L., Kamitani, Y., & Shimojo, S. (2000). Illusions. What you see is what you hear. Nature, 408(6814), 788. Shams, L., Kamitani, Y., & Shimojo, S. (2002). Visual illusion induced by sound. Brain Research Cognition Brain Research, 14(1), 147–152. Sheth, B. R., & Shimojo, S. (2004). Sound-aided recovery from and persistence against visual filling-in. Vision Research, 44(16), 1907–1917. Shimojo, S., & Shams, L. (2001). Sensory modalities are not separate modalities: plasticity and interactions. Current Opinion in Neurobiology, 11(4), 505–509. Shipley, T. (1964). Auditory flutter-driving of visual flicker. Science, 145, 1328–1330. Soto-Faraco, S., Navarra, J., & Alsius, A. (2004). Assessing automaticity in audiovisual speech integration: evidence from the speeded classification task. Cognition, 92(3), B13–B23. Soto-Faraco, S., Spence, C., & Kingstone, A. (2005). Assessing automaticity in the audiovisual integration of motion. Acta Psychology (Amsterdam), 118(1-2), 71–92. Spence, C., & Driver, J. (2000). Attracting attention to the illusory location of a sound: reflexive crossmodal orienting and ventriloquism. Neuroreport, 11(9), 2057–2061. Spence, C., & Driver, J. (2004). Crossmodal space and crossmodal attention. Oxford: Oxford University Press. Spence, C., & Squire, S. (2003). Multisensory integration: maintaining the perception of synchrony. Current Biology, 13(13), R519–R521. Stein, B. E., London, N., Wilkinson, L. K., & Price, D. D. (1996). Enhancement of perceived visual intensity by auditory stimuli: a psychophysical analysis. Journal of Cognitive Neuroscience, 8, 497–506. Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. Boston, MA: MIT Press. Tukey, J. (1977). Exploratory data analysis. London: Addison-Wesley. Ullman, S. (1979). The interpretation of visual motion. Cambridge, MA: MIT Press. van Santen, J. P. H., & Sperling, G. (1985). Elaborated Reichardt detectors. Journal of the Optical Society of America A. Optics and Image Science, 2(2), 300–321. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102(4), 1181–1186. Vatakis, A., Bayliss, L., Zampini, M., & Spence, C. The influence of synchronous audiovisual distractors on audiovisual temporal order judgments. Perception and Psychphysics, in press. Vatakis, A., & Spence, C. (2006a). Audiovisual synchrony perception for music, speech, and object actions. Brain Research, 1111(1), 134–142.

W. Fujisaki, S. Nishida / Vision Research 47 (2007) 1075–1093 Vatakis, A., & Spence, C. (2006b). Audiovisual synchrony perception for speech and music assessed using a temporal order judgment task. Neuroscience Letters, 393(1), 40–44. Vatakis, A., & Spence, C. (2006c). Temporal order judgments for audiovisual targets embedded in unimodal and bimodal distractor streams. Neuroscience Letters, 408(1), 5–9. Victor, J. D., & Conte, M. M. (2002). Temporal phase discrimination depends critically on separation. Vision Research, 42(17), 2063–2071. Vroomen, J., Bertelson, P., & de Gelder, B. (2001). The ventriloquist effect does not depend on the direction of automatic visual attention. Perception & Psychophysics, 63(4), 651–659. Vroomen, J., Keetels, M., de Gelder, B., & Bertelson, P. (2004). Recalibration of temporal order perception by exposure to audio-visual asynchrony. Brain Research Cognition Brain Research, 22(1), 32–35. Watanabe, K., & Shimojo, S. (2001). When sound affects vision: effects of auditory grouping on visual motion perception. Psychological Science, 12(2), 109–116.

1093

Watson, A. B., & Ahumada, A. J. Jr., (1985). Model of human visualmotion sensing. Journal of the Optical Society of America A. Optics and Image Science, 2(2), 322–341. Watson, A. B., & Robson, J. G. (1981). Discrimination at threshold: Labeled detectors in human vision. Vision Research, 21(7), 1115–1122. Welch, R. B., & Warren, D. H. (1986). Intersensory interactions. In Handbook of perception and human performance. In K. R. Boff & K. L. J. P. Thomas (Eds.). Cognitive processes and performance (Vol. 2). New York: Wiley. Wilcox, L. M., & Hess, R. F. (1997). Scale selection for second-order (nonlinear) stereopsis. Vision Research, 37(21), 2981–2992. Zampini, M., Guest, S., Shore, D. I., & Spence, C. (2005). Audiovisual simultaneity judgments. Perception & Psychophysics, 67(3), 531–544. Zampini, M., Shore, D. I., & Spence, C. (2003). Multisensory temporal order judgments: the role of hemispheric redundancy. International Journal of Psychophysiology, 50(1-2), 165–180.

Suggest Documents