Exp Brain Res (2005) 167: 66–75 DOI 10.1007/s00221-005-0008-z
R ES E AR C H A RT I C L E
Maurizio Gentilucci Æ Luigi Cattaneo
Automatic audiovisual integration in speech perception
Received: 13 September 2004 / Accepted: 30 March 2005 / Published online: 21 July 2005 Springer-Verlag 2005
Abstract Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants’ spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants’ voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a M. Gentilucci (&) Æ L. Cattaneo Dipartimento di Neuroscienze, Universita´ di Parma, Via Volturno 39, 43100 Parma, Italy E-mail:
[email protected] Tel.: +39-0521-903899 Fax: +39-0521-903900
labial consonant. The data are discussed in favor of the hypothesis that features of both the visual and acoustical inputs always contribute to the representation of a string of phonemes and that cross-modal integration occurs by extracting mouth articulation features peculiar for the pronunciation of that string of phonemes. Keywords McGurk effect Æ Audiovisual integration Æ Voice spectrum analysis Æ Lip kinematics Æ Imitation
Introduction Most of the linguistic interactions occur within a face-toface context, in which both acoustic (speech) and visual information (mouth movements) are involved in message comprehension. Although humans are able to understand words without any visual input, audiovisual perception is shown to improve language comprehension (Sumby and Pollack 1954), even when the acoustic information is perfectly clear (Reisberg et al. 1987). In support of this behavioral observation, brain-imaging studies have shown that, when the speaker is also seen by an interlocutor, the activation of the acoustical A1/ A2 and visual V5/MT cortical areas is greater than when the information is presented only in either acoustical or visual modality (Calvert et al. 2000). In addition, speech-reading activates acoustical areas also in absence of any acoustical input (Calvert et al. 1997). Two hypotheses, though not mutually exclusive, can explain the integration between information on verbal messages provided by the two sensory (acoustical and visual) modalities. The first hypothesis is based on specific cross-modal binding functions, and it postulates supra-modal integration (Calvert et al. 1999, 2000; Calvert and Campbell 2003). This integration could be based on similar patterns of time-varying features common to both the acoustical and the visual input. More specifically, the timing of changes in vocalization is visible as well as audible in terms of their time-varying
67
patterns (Munhall and Vatikiotis-Bateson 1998). For example, variations in speech sound amplitude can be accompanied by visible indicators of changes in the mouth articulator’s movement pattern. Another crossmodal function is based on features of stilled (configurational) besides moving face images (Calvert and Campbell 2003). Anatomically, cortical regions along the superior temporal sulcus (STS) may be involved in specific cross-modal functions. STS is activated by observation of biological motion including mouth movements during speech (Bonda et al. 1996; Buccino et al. 2004; Calvert et al. 2000; Campbell et al. 2001) and shows consistent and extensive activation also when hearing speech (Calvert et al. 1999, 2000). Calvert et al. (2000) observed that for audiovisual appropriately synchronized speech, the profile of STS activation correlated with enhanced neuronal activity in sensory-specific visual (V5/MT) and auditory (A1/A2) cortices. This cross-modal gain may be mediated by back projections from STS to sensory cortices (Calvert et al. 1999). The second hypothesis is based on the possibility that presentation of either a human voice pronouncing a string of phones or a face mimicking pronunciation of a string of phonemes activates automatic imitation of the two stimuli. It is possible that the information provided by the two different modalities is integrated by superimposing an imitation mouth program automatically elicited by the visual stimulus on another one automatically elicited by the acoustical stimulus, in accordance with the motor theory of speech perception (Liberman and Mattingly 1985). In this respect, cortical regions within Broca’s area may be involved in audiovisual integration by imitation since it is activated by observation/imitation of moving and speaking faces (Buccino et al. 2004; Calvert and Campbell 2003; Campbell et al. 2001; Carr et al. 2003; Leslie et al. 2004; for a review see Bookheimer 2002). The activity of Broca’s area is significantly correlated with the increased excitability of the motor system underlying speech production when perceiving auditory speech (Waltkins and Paus 2004). This area is involved also in observation/imitation of hand movements (Iacoboni et al. 1999; Buccino et al. 2001, 2004; Heiser et al. 2003) according to the hypothesis that this area represents one of the putative sites of the human ‘‘mirror system’’, which is thought to be evolved from the monkey premotor cortex, and to have acquired new cognitive functions such as speech processing (Rizzolatti and Arbib 1998). The McGurk effect (McGurk and MacDonald 1976) represents a particular kind of audiovisual integration in which the acoustical information on a string of phonemes is contrasting with the visually presented mouth articulation gesture. When people process two different syllables, one presented in the visual modality and the other presented in acoustical modality, they tend either to fuse or to combine the two elements. For example, when the voice of the talker pronounces the syllable/ba/and her/his lips mimic the syllable/ga/, the observer tends to fuse the two syllables and to perceive the syllable/da/.
Conversely, when the talker’s voice pronounces/ga/and her/his lips mimic/ba/, the observer tends to combine the two elements and to perceive either/bga/or/gba/. The finding that combination rather than fusion between the two strings of phonemes occurs when the visual information on the syllable is unambiguous (/ba/ versus/ga/) suggests that merging the visual information with the acoustical information such as observed in the fusion effect occurs only in particular circumstances, i.e. when the visual stimulus offers multiple interpretations on the string of phonemes (note that external mouth pattern of/ga/is not much different from that of/da/). The trend to fuse auditory and visual speech together seems to have some characteristics of specificity for the used language. Indeed, although it has been well documented for English speakers (for a review Chen and Massaro 2004; Summerfield 1992; Massaro 1998), some Asian people such as Japanese and Chinese, are less subjected to the McGurk effect (Chen and Massaro 2004; Sekiyama and Tohkura 1993). These data pose the following problem: does the process of audiovisual matching code representations lacking features of either the visual or the acoustical stimulus or, in contrast, does it code representations always containing features of both the two information? In the present study we tested the two hypotheses by taking into account in the McGurk’s paradigm the responses in which the participants repeated either the visually or the acoustically presented string of phonemes. By using techniques of kinematics analysis and voice spectra analysis, we verified whether the two presentations always influenced the responses. In particular, we verified whether the voice spectra of the repeated string of phonemes changed as compared to the voice spectra of the same string of phonemes repeated in the condition of congruent visual and acoustical stimuli. Moreover, we verified whether they approached the voice spectra of the string of phonemes presented in the other sensory modality. A second problem is whether audiovisual integration is based either on superimposition of two automatic imitation motor programs or on cross-modal elaboration. The imitation hypothesis postulates that speech perception occurs by automatically integrating the mouth articulation pattern elicited by the acoustical with that elicited by the visual stimulus (Liberman and Mattingly 1985). The cross-modal hypothesis postulates that perception occurs by supra-modal integration of time-varying characteristics of speech extracted from both the visual and the acoustical stimulus (Calvert et al. 1999, 2000; Calvert and Campbell 2003). To test the two hypotheses we analyzed the responses in which the acoustically presented string of phonemes was repeated and verified whether its external mouth pattern was influenced by the visual stimulus, i.e. by the external mouth pattern mimicked by the actor. If two automatic imitation motor programs are superimposed, an effect of the visual stimulus on the observer’s external mouth pattern is always seen. This should occur also when the string of phonemes mimicked by the actor requires pe-
68
culiar modification of the internal mouth and external mouth movements are consequent and indirectly related to pronunciation of the string of phonemes (in the present study/aga/). On the other hand, if time-varying features specific for the string of phonemes are extracted from the visual stimulus (cross-modal integration hypothesis) we should observe an effect of the only visually presented string of phonemes with labial consonants, i.e. with external mouth modification peculiar to pronunciation of that string of phonemes (in the present study/aba/).
Methods Sixty-five right-handed (according to Edinburgh inventory, Oldfield 1971) Italian-speakers (51 females and 14 males, age 22–27 years.) participated in the present study. The study, to which the participants gave written informed consent, was approved by the Ethics Committee of the Medical Faculty of the University of Parma. All participants were naı¨ ve as to the McGurk paradigm and, consequently, to the purpose of the study. They were divided in three groups of eight, 31 and 26 individuals. Each group took part in one of three experiments (see below). Participants sat in front of a table, placing either their forearms on the table plane, in a soundproof room. They were required not to move their head and trunk throughout the experimental session. A PC screen placed on the table plane was 40 cm distant from the participant’s chest. Two loudspeakers were at the two sides of the display. The stimuli presented on the PC screen were the following three strings of letters or phonemes: ABA (/ aba/), ADA (/ada/) and AGA (/aga/). Note that in Italian the vowel A is always pronounced/a/. In experiment 1 (string of letters reading) they were printed in white on the centre of the black PC display. Each letter was 3.9 cm high and 2.5 cm wide. It was presented 1,360 ms from the beginning of the trial and lasted 1,040 ms. In experiments 2 and 3 (audiovisual presentation of string of phonemes) an actor (face: 6.9·10.4 cm) pronounced the three strings of phonemes. His half-body was presented 2,360 ms after the beginning of the trial and presentation lasted 2,000 ms. In all the experiments a ready signal, i.e. a red circle and a BEEP (duration of 360 ms) were presented at the beginning of the trial. The following three experiments were carried out: Experiment 1. Eight subjects participated in the experiment. The participants were presented with the printed strings of letters. The task was to read silently, and, then, to repeat aloud the string of letters (string-ofletters reading paradigm). Experiment 2. Thirty-one subjects participated in the experiment. The actor pronounced one of the three strings of phonemes. In the congruent audiovisual presentation, his visible mouth (visual stimulus) mimicked and his voice (acoustic stimulus) pronounced the same string of phonemes. In the incongruent audiovisual
presentation, the visible actor’s mouth mimicked pronunciation of AGA, whereas his voice concurrently pronounced ABA (McGurk paradigm). Experiment 3. Twenty-six subjects participated in the experiment. The experiment differed from experiment 2 only for the incongruent audiovisual presentation in which the visible actor’s mouth mimicked pronunciation of ABA, whereas his voice simultaneously pronounced AGA (inverse McGurk paradigm). In all the experiments the participants were required to repeat aloud, at the end of the audio and/or visual stimulus presentation, the perceived string, using a neutral intonation and a voice volume like during normal conversation. They were not informed that in some trials the visual and acoustical stimuli were incongruent. No constraint in response time was given. At the end of the experimental session, all participants filled in a questionnaire in which they indicated (1) whether during the experimental session the sound of each string of phonemes (i.e. ABA, ADA, and AGA) varied and (2) whether they noticed that in some trials there was incongruence between the acoustical and the visual stimulus. Each string of letters or phonemes was randomly presented 5 times. Consequently, experiment 1 consisted of 15 trials. On the other hand, since experiments 2 and 3 included both congruent and incongruent conditions, they consisted of 20 trials each. Participants’ lip movements were recorded using the 3D-optoelectronic ELITE system (B.T.S. Milan, Italy). It consists of two TV-cameras detecting infrared reflecting markers at the sampling rate of 50 Hz. Movement re-construction in 3D coordinates and computation of the kinematics parameters are described in a previous study (Gentilucci et al. 1992). Two markers were placed on the centre of the participant’s upper and lower lip. The participant’s two aperture-closure movements of lips during the string of phonemes pronunciation were measured by analyzing the time course of the distance between the upper and lower lip. Participant’s maximal lip aperture, and final lip closure (i.e. minimal distance between upper and lower lip) at the end of the first lip closing, and peak velocity of lip opening and maximal lip aperture during the second lip opening were measured. These parameters characterize the kinematics of the lips during consonant pronunciation. The procedures to calculate the beginning and the end of lip movements were identical to those previously described (Gentilucci et al. 2004). The time course of the actor’s lip movements was recorded in 2D space at the sampling rate of 30 Hz. Lip displacements were measured using the PREMIERE 6.0 software (ADOBE, http://www.adobe.com). We did not use the ELITE system to record the actor’s lip movements in order to avoid the possibility that during the visual presentation the markers on the lips prevented the participants from recognizing the string of phonemes. Figure 1 shows the time course of the distance between the actor’s upper and lower lip (squares) and of the distance between the right and left
69
corner of the actor’s lips (diamonds). Note that the final lip closure decreased moving from AGA to ABA (squares in Fig. 1), whereas poor variation in the distance between left and right corner of lips was observed among the three strings of letters (diamonds in Fig. 1). The voice emitted by the participants and the actor was recorded by means of a microphone (Studio Electret Microphone, 20–20,000 hz, 500 X, 5 mv/Pa/1 kHz) placed on a table support. The centre of the support was 8.0 cm distant from the subject’s chest, on the right with respect to the participant and 8.0 cm distant from the participant’s sagittal axis. The microphone was connected to a PC by a sound card (16 PCI Sound
Fig. 1 Time course of distance between upper and lower lip (squares) and left and right lip corners (diamonds) of the actor pronouncing ABA, ADA, and AGA strings of phonemes
Blaster, CREATIVE Technology Ltd. Singapore). The spectrogram of each string of phonemes was computed using the PRAAT software (University of Amsterdam, the Netherlands). The time courses of the formant (F) 1 and 2 of the participants and the actor were analyzed. The time course of the string-of-phonemes pronunciation was divided in three parts. The first part (T1-phase) included pronunciation of the first/a/vowel and the formant transition before mouth occlusion. The latter approximately corresponded to the first mouth closing movement. The second part (T0-phase) included mouth occlusion. Only the mouth occlusion of ABA pronunciation corresponded to the final lip closure. The mouth occlusion of the other strings corresponded to the final closure of internal mouth parts not recorded by kinematics techniques. The third part (T2-phase) included the formant transition during release of mouth occlusion, approximately corresponding to the second mouth opening movement, and pronunciation of the second/a/ vowel. The durations of participants’ T1-phase, T0phase and T2-phase were measured. Mean values of F1 and F2 of the participants and of the actor during the T1-phase and the T2-phase were calculated. Finally, participants’ and actor’s mean intensity of voice during string of phonemes pronunciation was measured. Mean F1 of the actor’s voice was 820, 721, and 746 Hz and mean F2 was 1,330, 1,393, and 1,429 Hz when pronouncing ABA, ADA and AGA, respectively. Intensity was on average 54.9 db. In experiment 1 the statistical analyses on the lip kinematics and the voice spectra of the pronunciation of ABA, ADA, and AGA were carried out in order to discover differences in lip kinematics and voice spectra among the three strings of phonemes. In experiments 2 and 3, the statistical analyses compared lip kinematics and voice spectra of the strings of phonemes pronounced in the congruent audiovisual presentation with those in the incongruent audiovisual presentation. The aim was to verify whether the string of phonemes in the incongruent condition differed from the corresponding string of phonemes in the congruent condition and, if so, the direction of the change. The experimental design included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3, the string of phonemes pronounced in the incongruent audiovisual presentation) as within-subjects factor for maximal lip aperture, lip closure, peak velocity of lip opening, and intensity of voice. In contrast, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in the incongruent audiovisual presentation) and phase (T1, and T2) for F1 and F2. Finally, it included string of letters or phonemes (ABA, ADA, AGA, and in experiments 2 and 3 the string of phonemes pronounced in incongruent audiovisual presentation) and phase (T1, T0, and T2) for time course of formant. The latter analysis aimed to discover differences in duration of vowel (including formant transition) and consonant pronunciation between strings of phonemes pronounced in the congruent and
70 Fig. 2 Examples of spectrograms during pronunciation of ABA, ADA, and AGA strings of phonemes in experiments 1, 2, and 3. T1-phase: pronunciation of the first/a/vowel and the formant transition before mouth occlusion. T0-phase: mouth occlusion. T2-phase: formant transition during release of mouth occlusion, and pronunciation of the second/a/ vowel. F1: formant 1, F2: formant 2
incongruent audiovisual presentations. Separate ANOVAs were carried out on mean values of the participants’ parameters. The Newman-Keuls post-hoc test was used (significance level set at P