Audio-visual identification of place of articulation

0 downloads 0 Views 371KB Size Report
... 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see .... congruent voicing stimuli the audio and visual components differ in voicing.
Audio-visual identification of place of articulation and voicing in white and babble noisea) Magnus Almb兲 and Dawn M. Behne Department of Psychology, Norwegian University of Science and Technology, N-7491 Trondheim Norway

Yue Wang Department of Linguistics, Simon Fraser University, Burnaby, British Columbia V5A 156 Canada

Ragnhild Eg Department of Psychology, Norwegian University of Science and Technology, N-7491 Trondheim Norway

共Received 25 June 2008; revised 15 April 2009; accepted 15 April 2009兲 Research shows that noise and phonetic attributes influence the degree to which auditory and visual modalities are used in audio-visual speech perception 共AVSP兲. Research has, however, mainly focused on white noise and single phonetic attributes, thus neglecting the more common babble noise and possible interactions between phonetic attributes. This study explores whether white and babble noise differentially influence AVSP and whether these differences depend on phonetic attributes. White and babble noise of 0 and −12 dB signal-to-noise ratio were added to congruent and incongruent audio-visual stop consonant-vowel stimuli. The audio 共A兲 and video 共V兲 of incongruent stimuli differed either in place of articulation 共POA兲 or voicing. Responses from 15 young adults show that, compared to white noise, babble resulted in more audio responses for POA stimuli, and fewer for voicing stimuli. Voiced syllables received more audio responses than voiceless syllables. Results can be attributed to discrepancies in the acoustic spectra of both the noise and speech target. Voiced consonants may be more auditorily salient than voiceless consonants which are more spectrally similar to white noise. Visual cues contribute to identification of voicing, but only if the POA is visually salient and auditorily susceptible to the noise type. © 2009 Acoustical Society of America. 关DOI: 10.1121/1.3129508兴 PACS number共s兲: 43.71.Sy 关AJ兴

Pages: 377–387

I. INTRODUCTION

Audio-visual speech perception 共AVSP兲 denotes the human ability to benefit from both the auditory and visual modality in perceiving and interpreting speech. Research has shown that the visual modality plays an important role in perception of speech 共e.g., Green and Kuhl, 1989; Massaro, 1987; McGurk and MacDonald, 1976兲 and its contribution is emphasized in conditions where the speech signal is degraded by noise 共e.g., Erber, 1969; MacLeod and Summerfield, 1987; Sumby and Pollack, 1954兲. Previous research on AVSP has focused on the effect of white noise 共e.g., Dodd, 1977; Fixmer and Hawkins, 1998兲, an auditory distractor infrequently present in everyday human speech communication, while the effect of babble noise is much less explored 共e.g., Behne et al., 2006; Markides, 1989; Wang et al., 2008兲. The phonetic attributes place of articulation 共POA兲 and voicing differ in susceptibility to noise 共e.g., Miller and Nicely, 1955兲. However, whereas the effect of noise level on these attributes is well established 共e.g., Jiang et al., 2006; Miller and Nicely, 1955兲, the effect of different types of noise is not thoroughly explored. Research shows that speech perception a兲

Portions of this work were presented at Acoustics’08 and Interspeech 2008. This work is based on Magnus Alm’s MA thesis, Norwegian University of Science and Technology 共NTNU兲, Trondheim Norway, 2007. b兲 Author to whom correspondence should be addressed. Electronic mail: [email protected] J. Acoust. Soc. Am. 126 共1兲, July 2009

proficiency by hearing-impaired and second language learners improves with training by directing visual attention to the specific cues relevant to the different attributes of speech 共e.g., Lansing and McConkie, 1999; Massaro and Light, 2004; Wang et al., 2008兲. The current study contributes to identifying where visual attention should be directed for POA and voicing in babble noise. Consequently, the present study on speech perception investigates the influence of different types and levels of noise, using audio-visual 共AV兲 stimuli that differ in AV saliency 共i.e., POA and voicing兲. By using AV stimuli in which auditory syllables were dubbed onto visual syllables with different POAs, McGurk and MacDonald 共1976兲 elegantly demonstrated vision’s contribution to speech perception. Not only did responses frequently correspond to the visual component but also the AV cues were often fused and hence perceived as an intermediate percept to that of the auditory and visual components 共i.e., AV-fusion兲. These findings show the bimodal nature of speech perception and are supported by later studies 共e.g., Green and Kuhl, 1989; MacDonald and McGurk, 1978; Massaro, 1987; Summerfield and McGrath, 1984兲. The availability of bimodal information renders speech perception less vulnerable to white noise 共e.g., Erber, 1969; MacLeod and Summerfield, 1987; Sumby and Pollack, 1954兲. Macleod and Summerfield 共1990兲 demonstrated that in white noise AV cues increase signal-to-noise ratio 共SNR兲 thresholds by 6.1 dB compared to auditory speech alone.

0001-4966/2009/126共1兲/377/11/$25.00

© 2009 Acoustical Society of America

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

377

Since optimal hearing conditions greatly favor the auditory cues in AVSP, visual and AV-fusion responses are more likely to occur in noisy environments 共e.g., Dodd, 1977; Easton and Basala, 1982; Ross et al., 2007兲. However, whereas an increase in auditory noise results in a proportional shift toward visual responses, the relationship between noise level and AV-fusions is not linear 共e.g., Ross et al., 2007; Sommers et al., 2005兲. Previous research 共e.g., Dodd, 1977; Easton and Basala, 1982; Fixmer and Hawkins, 1998兲 demonstrates that moderate white noise facilitates AV-fusion responses, whereas extremely positive or negative SNRs favor the auditory or visual modality, respectively 共Ross et al., 2007兲. Most studies on AVSP in noise referred to above used white noise. White noise has a flat power spectral density, meaning equal intensity levels across frequencies in a given band, while babble noise is a fluctuating signal with a lowfrequency dominated spectral shape. Thus, given the same energy, white noise covers a wider range of frequencies than babble noise at any given time. The masking effect of noise is partly determined by the frequency overlap between a target stimulus and noise, and the size and frequency of uninterrupted speech intervals influence the intelligibility of the target signal in noise 共i.e., glimpsing effect兲 共Cooke, 2006; Miller and Lickider, 1950兲. To what extent noise influences the different modalities’ contribution to AVSP depends on the characteristics of the speech signal 共e.g., Binnie et al., 1974兲. When perceiving speech, phonetic attributes such as POA and voicing have different susceptibilities to noise. Miller and Nicely 共1955兲 showed that for auditory signals, POA identification is far more susceptible to noise than voicing identification. Whereas the identification of POA suffers at 6 dB SNR, voicing identification is robust at −12 dB SNR. These findings are supported by Jiang et al. 共2006兲, who found that voicing identification was above chance level to a SNR of −15 dB. For stop consonants the acoustical cues important for POA and voicing identification differ. For POA identification the formant transition is deemed the most important cue 共Blumstein et al., 1982; Delattre et al., 1955; Stevens and Blumstein, 1978兲, whereas for identification of voicing, the voice onset time 共VOT兲 is considered the most important cue 共Eimas and Corbit, 1973兲. In distinguishing consonants, the formant transition involves subtle acoustic variations within a wide range of frequencies, whereas VOT is associated with temporal variations in the time interval between consonant release and voice onset. An acoustically distinct event defines the end of VOT; that is, the speech signal shifts from a relatively flat intensity distribution across frequencies 共aspiration兲 to a more fluctuating intensity distribution across frequencies 共voice onset兲. This acoustic event is more temporally distinct than the subtle acoustic variations distinguishing different formant transitions and may contribute to voicing identification being less susceptible to noise than POA. POA’s lack of auditory robustness is greatly compensated for by salient visual cues; that is, seeing the face of the speaker greatly aids perceivers in identifying POA in noise 共e.g., Binnie et al., 1974兲. This visual benefit is not found for voicing identification 共e.g., Behne et al., 2006; Binnie et al., 1974兲. Behne et al. 共2006兲 found that responses to incongru378

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

ent AV syllables that varied in terms of voicing always matched the auditory component, independent of which component of the stimulus was voiced and independent of the presence of noise. The lack of visual access to the activity in the vocal folds may explain the poor visual contribution to voicing identification. The current study employs the McGurk paradigm 共McGurk and MacDonald, 1976兲 to explore whether white and babble noise influence the use of the auditory and visual modality differently, and whether these noise type differences depend on which phonetic attribute is being assessed. White noise is widely used in AV research, while babble noise, arguably a more common speech interference, is seldom used. Babble noise differs from white noise acoustically and possibly semantically. While phonetic attributes’ susceptibility to different noise levels is well established 共e.g., Jiang et al., 2006; Miller and Nicely, 1955兲, the susceptibility to different noise types is less explored. Two hypotheses are tested: first, white noise is expected to result in fewer auditory and more visual and AV-fusion responses than babble noise, because white noise occludes more of the target signal than babble noise, that is, covers more frequencies. Second, white noise is expected to result in fewer auditory and more visual responses than babble noise for the noise susceptible POA identification, while no such noise type difference will be observed for the noise robust voicing identification. II. METHODS A. Design

Responses were collected for stimulus presentations consisting of congruent and incongruent AV stimuli containing stop-vowel syllables, varying in terms of POA, voicing, noise type, and noise level, using a repeated measures design. B. Participants

Fifteen native Norwegian speakers within the range of 20– 31 years of age 共M = 24, SD = 3兲 participated in the experiment, seven of which were males and eight of which were females. The participants were students recruited from The Norwegian University of Science and Technology 共NTNU兲 and Sør-Trøndelag University College 共HIST兲. All participants reported normal hearing and normal or corrected-to-normal vision. C. Stimuli

As shown in Table I, six congruent and ten incongruent AV stimuli were used in the experiment. The congruent stimuli were used as a control to measure the general AV intelligibility of the syllables across noise conditions 共i.e., different noise types and noise levels兲. The incongruent syllables were used to test the hypotheses. The AV stimuli were made from audio and visual recordings of six different monosyllables that differed in POA and voicing: Labial /ba/ and /pa/, alveolar /da/ and /ta/, and velar /ga/ and /ka/. Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

TABLE I. The six congruent 共left兲 and the ten incongruent 共right兲 AV syllables used as test materials. The first four incongruent syllables differ in POA. The remaining six differ in voicing. Congruent stimuli Auditory

Incongruent stimuli

Visual

Auditory

ba pa da ta ga ka

ba pa ga ka

Visual

POA stimuli ba pa da ta ga ka

ga ka ba pa Voicing stimuli

ba pa da ta ga ka

pa ba ta da ka ga

As shown in Table I, congruent stimuli refer to stimuli in which the audio and visual components correspond and incongruent stimuli refer to stimuli in which the audio and visual components differ. For incongruent POA stimuli, the audio and visual components differ in POA, whereas for incongruent voicing stimuli the audio and visual components differ in voicing. Both POA and voicing stimuli included alternative stimulus structures: the POA stimulus structure was either AlabialVvelar or AvelarVlabial, and the voicing stimulus structure was either AvoicedVvoiceless or AvoicelessVvoiced. A POA stimulus was either voiced or voiceless, whereas a voicing stimulus was either labial, alveolar, or velar. Therefore, the POA stimuli revealed participants’ AV perception of POA, as well as the effect of consonant voicing on this POA perception. The voicing stimuli revealed participants’ AV perception of voicing, as well as the effect of consonant POA on this voicing perception. Four noise backgrounds were added to the congruent and incongruent AV stimuli: two intensity levels of white and babble noise. This resulted in four noise conditions in addition to the quiet condition. 1. AV recordings

The current study used AV recordings that were made for a previous study 共see Behne et al., 2006; Behne et al., 2007兲. The AV recordings of a male speaker were made in the Speech Laboratory at the Department of Psychology, NTNU. The speaker had an urban Eastern-Norwegian dialect that is familiar to most Norwegians. The speaker was clean shaven and any artificial distractors, such as glasses and jewelry, were removed prior to video recording. The speaker was instructed to keep a relatively flat intonation, avoiding a decline or incline at the end of a syllable. He was also told to keep facial gestures, such as eye blinks, to a minimum. The speaker was seated inside a sound-insulated room with a Sony DCR-TRV50E camera positioned 90 cm in front of him and a Røde NT3 microphone positioned 50 cm to the left and 10 cm above his head. Two parallel audio recordings J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

were made: one from the video camera’s internal microphone and one from the external microphone 共i.e., Røde兲. The sound from the external microphone was fed through an M-Audio Firewire 1814 box to a Apple Macintosh G5 computer, where two audio channels were recorded at a sampling rate of 44.1 kHz using PRAAT version 4.3.22 共Boersma and Weenink, 2006兲. To avoid any visual distractions in the stimuli, the speaker was seated with his back toward a gray monotonous wall. His face was centered in the camera frame, leaving 4 in. of empty space on either side of the cheeks and 1 in. of empty space above the head when the video was displayed on a 17 in. computer monitor. The scene was thus believed to represent the visual field common to participants engaged in natural face-to-face communication. The syllables used in the experiment consisted of six consonants /b, d, g, p, t/ and /k/, followed by the vowel /a/. The speaker repeated each of the syllables eight times to create a set of alternatives for finding one in which both auditory and visual qualities were good. The resulting QuickTime video file had a visual quality of 30 frames/ s at a resolution of 640⫻ 480 pixels. The MPEG4/H264 video compression algorithm was used at a bit-rate of 57 600 kbps. The video file was segmented into separate syllables, using the software IMOVIE HD 5.0.2. The audio files derived from the external microphone were segmented using PRAAT 共Boersma and Weenink, 2006兲. The segmented video and audio files were rated independently by two different persons. Highly rated video segments were those in which syllable articulations were explicit and eye blinks or other unwanted facial gestures few. A highly rated audio segment implied a natural syllable pronunciation and a relatively even intonation, accompanied by no unwanted noise, such as that from movement in the recording environment. The six required AV syllables were selected based on the highest additive audio and visual ratings. The audio segments were edited in PRAAT to ensure the same unweighted intensity for all syllables and the corresponding video clips were cut in iMovie HD to meet the length of the longest video syllable 共1960 ms兲. The auditory speech signals had an average length of 423 ms 共range 377– 468 ms兲, measured from consonant release for voiceless syllables and from onset of the voice bar for voiced syllables. Voiced syllables were measured from the onset of the voice bar in order to take into account an instance of prevoicing. The auditory speech signals were initiated 573 ms after the onset of the video clips, hereby enabling initiation of noise signals 573 ms prior to the auditory speech signals and thus avoiding artifacts caused by sudden onset of noise. 2. Noise signal

Two types of noise were used in the experiment: babble noise and white noise. The babble noise was recorded during lunchtime in a cafeteria at NTNU, using an Okay II DM-801 microphone connected to an SHG Note 40750 laptop via its built-in soundcard, and using a sampling frequency of 44.1 kHz. A segment of the recording was extracted in which babble was prominent and other sounds such as coughs and the rattling of cutlery were minimal. No individual voices Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

379

could be discerned in the babble segment. The white Gaussian noise used in the experiment was generated using the “Create sound” function in PRAAT 共Boersma and Weenink, 2006兲. The babble and the white noise segments were cut to a length of 1960 ms, equaling the length of the video clips. Two SNRs were used in the experiment. SNR was calculated by subtracting the mean noise intensity level from the mean speech signal intensity level. Different phonetic attributes, in this respect POA and voicing, are associated with different vulnerabilities to noise. A prior study by Behne et al. 共2006兲 showed that for AV stimuli differing in POA, babble noise at a SNR of 0 dB led to an increase in the use of the visual modality, whereas no such shift was evident for AV stimuli that vary in voicing at this noise level. These results are supported by a firm body of research 共e.g., Jiang et al., 2006; Miller and Nicely, 1955兲. To further assess the AV benefit for voicing identification, a noise level at which visual responses to voicing stimuli occur had to be established. Miller and Nicely 共1955兲 demonstrated that auditory voicing perception is robust at SNRs up to −12 dB, whereas Jiang et al. 共2006兲 revealed that a SNR of −15 dB resulted in auditory responses that were rather arbitrary. The interval between −12 and −15 dB SNR was therefore considered. A pilot study with seven participants indicated a shift in use of modality near SNR −12 dB. Furthermore, this level was associated with predictable responding; that is, most participant responses corresponded either to the visual or the auditory part of the signal and were hence not subjected to merely guessing. The use of noise at 0 and −12 dB SNRs therefore seemed reasonable. The two noise intensity levels were adjusted to 0 and −12 dB 共relative levels兲 using PRAAT 共Boersma and Weenink, 2006兲, resulting in a noise subset of five: white and babble noise of 0 dB SNR, white and babble noise of −12 dB SNR, and quiet.

3. Assembling stimuli

As shown in Table I, six congruent and ten incongruent AV stimuli were used in the experiment. The congruent stimuli were created by importing the video clips into iMovie HD and substituting the auditory speech signals recorded by the video camera’s internal microphone with the corresponding auditory speech signals from the same syllable recorded by the external microphone. The incongruent stimuli were produced in the same manner, except that the original auditory speech signals were substituted with auditory speech signals that differed in POA or voicing. The auditory and visual signals were temporally aligned by synchronizing the consonant burst in the acoustic signal with mouth opening in the video clip. The resulting congruent and incongruent AV syllables constituted the quiet condition of the experiment. The four noise segments 共i.e., babble and white noise at 0 and −12 dB SNRs兲 had the same length as the film clips, ensuring that the noise covered the entire syllable. The noise segments were imported into iMovie HD and added to the already existing congruent and incongruent AV syllables. To combat the problem of differing total amount of energy imposed by the different noise levels, the new audio files were 380

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

FIG. 1. Spectrum levels for the two categories of speech stimuli and the two noise types. For the speech stimuli, levels represent the average of the three voiced syllables and the three voiceless syllables, respectively, with a time window of 100 ms, starting at the burst. Levels are averaged across 1 / 3 octave bands. The level reference is arbitrary but the same for all curves.

adjusted in PRAAT so that each stimulus had the same average intensity. Figure 1 illustrates spectrum levels for syllables and noise types for a 100 ms time window, starting at the consonant burst. For Norwegian stop consonants in initial position, aspiration is the most prominent voicing distinction, with unaspirated stops 共/ba, da/ and /ga/兲 generally having a short voice onset lag, and aspirated stops 共/pa, ta/ and /ka/兲 having a long voice onset lag 共Halvorsen, 1998兲. These differences in VOT contribute to explaining why the average of the voiced syllables 共/ba, da/ and /ga/兲 have a higher overall intensity than the average of the voiceless syllables 共/pa, ta/ and /ka/兲 in this particular time segment. Figure 1 also shows that white noise has a flat spectrum, whereas for babble noise the spectrum decreases as frequency increases. The 16 different AV syllables in five different auditory backgrounds resulted in a total of 80 different stimuli used in the experiment. D. Procedure

The experiment was carried out in the Speech Laboratory at the Department of Psychology, NTNU. To mimic circumstances typical for face-to-face communication, participants were seated facing 17 in. monitors 共1440 ⫻ 900 pixels兲 at approximately 50 cm distance. Audio signals were conveyed to AKG K271 stereo closed dynamic circumaural studio headphones, and the presentation level was fixed at 68 dBA 共corresponding to a frontally incident free-field sound pressure level around 68 dBA兲 for all participants. Participants were presented four stimulus blocks, each block containing the 80 different stimuli. Thus every distinct stimulus appeared four times during the experiment, but only once in a block. The stimuli in each block were randomized. Participants were told to indicate which among the six alternative syllables 共ba, da, ga, ka, pa, and ta兲 best corresponded to the syllable perceived. Because of possible perAlm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

FIG. 2. Mean percent correct responses in the AO and AV congruent conditions in quiet and in babble noise at 0 and −12 dB SNRs. Mean percent auditory responses for all AV-incongruent stimuli in quiet and in babble noise at 0 and −12 dB SNR are also included. The AV- congruent and AV- incongruent responses were collected in the current study, whereas the AO responses were collected in a parallel study.

ceived ambiguity due to incongruent AV signals or noise, the participants were told that no wrong answers existed. The participants were frequently reminded to look at the talker’s face throughout the entire duration of every clip to ensure that the participants received both auditory and visual input. The experiment took approximately 1 h with a 10 min break included halfway.

III. RESULTS A. Comparisons with audio-only and AV congruent stimuli

Audio-only 共AO兲 and AV-congruent results were included in order to 共1兲 establish whether the syllables used were good tokens of their respective categories, 共2兲 address whether the babble noise unique in this study affects syllable perception in the absence of visual influence, and 共3兲 assess whether responses to the AV-incongruent stimuli were likely to be based on chance. Results from the AO condition were taken from a parallel study using the same stimuli as those in the current study. The two participant groups were recruited from the same university population, were comparable in age 共current study: M = 24, SD = 3; parallel study: M = 23, SD = 4兲, and were equally balanced for gender. The two participant groups also had almost identical response patterns in the AVincongruent conditions 共e.g., mean percents A-match for POA stimuli were 44.4% and 45% 关F共1 , 43兲 = 0.036, p ⬎ 0.85兴兲. Figure 2 shows percent correct responses from the AO and AV-congruent conditions collapsed for all syllables. AO results show that the auditory stimuli are good tokens of their respective categories, with participants getting nearly all responses correct in quiet and in babble noise of 0 dB SNR. The sharp drop off in correct responses at −12 dB SNR in the AO condition is greatly compensated for by relevant viJ. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

sual information in the AV-congruent condition. This indicates that the visual components are also good tokens of their respective categories. Results from the AO condition further indicate that the auditory perception of syllables is robust in babble noise. Even at SNR of −12 dB participants responded well above chance, with 52% correct. The AV-incongruent stimuli did not allow for correct or incorrect responses; the response options either matched the visual component, the audio component, or were intermediate to the two. In the AO and AV-congruent conditions, participants had a 17% 共100% divided by six response alternatives兲 possibility of giving a correct response by chance. The high correct response percentage obtained for the AO and AV-congruent stimuli renders it unlikely that observed differences for the incongruent stimuli are due to chance responses. Figure 2 also includes the overall percentage of auditory responses for all AV-incongruent stimuli in the current study. The pattern of responses shows that reliance on auditory cues is negatively influenced by incongruent visual cues. Compared to the AO condition, the introduction of incongruent visual information resulted in a decrease in the overall reliance on the auditory input by 5.5% in quiet and 10.7% in babble noise in the 0 dB SNR. In babble noise in the −12 dB SNR, the AV- incongruent and AO stimuli both received 52% audio responses, although the AV-incongruent mean had a larger variance 共SD = 41兲 than AO 共SD = 29兲. The strength and variation of this visual influence are considered in detail in the analyses of the AV-incongruent stimuli, where it is used as a means of assessing the relative influence of white and babble noise on AV perception of POA and voicing. B. AV incongruent stimuli

The incongruent stimuli consisted of POA and voicing stimuli and are used to assess shifts in the contribution of Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

381

FIG. 3. Mean percent responses for A-match, V-match, and AV-fusion for incongruent POA stimuli for voiced and voiceless consonants presented in quiet and in babble and white noise in the 0 and −12 dB SNR conditions.

modalities in noise. For POA stimuli, responses were coded based on whether they matched the stimuli’s audio component 共A-match兲, visual component 共V-match兲, or were intermediate to the audio and visual component 共AV-fusion兲. For voicing stimuli, responses were coded based on whether they matched the stimuli’s audio component 共A-match兲 or video component 共V-match兲. Match for a POA stimulus implied correspondence between response and stimulus in terms of POA, whereas match for a voicing stimulus implied correspondence between response and stimulus in terms of voicing; thus a “perfect” match 共i.e., response matching stimulus for both POA and voicing兲 was not required for either POA or voicing stimuli.

1. Quiet condition

For the incongruent POA and voicing stimuli, the response patterns in the quiet conditions constitute baseline AV perception, and are considered points of reference for the effects of noise. The results of the quiet condition for the incongruent POA stimuli are illustrated in Fig. 3. In quiet 84% of the responses matched the auditory component compared to an overall of 45% in babble, and 23% in white noise. Paired t-tests with Bonferroni corrected p-values revealed that participants gave significantly more A-match responses in quiet than in babble noise of 0 dB SNR 共p ⬍ 0.004兲, babble noise of −12 dB SNR 共p ⬍ 0.004兲, white noise of 0 dB SNR 共p ⬍ 0.004兲, and in white noise of −12 dB SNR 共p ⬍ 0.004兲. No significant differences between voiced and voiceless stimuli were obtained in the quiet condition. 382

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

For incongruent voicing stimuli presented in quiet, 100% of the responses match the audio component, a result identical to that obtained for white and babble noise at 0 dB SNR. 2. POA stimuli

Data for POA stimuli 共Table I兲 were analyzed with three repeated measures analyses of variance 共ANOVAs兲 where noise type 共white and babble兲, noise level 共0 and −12 dB SNRs兲, voicing 共voiced and voiceless兲, and POA stimulus structures 共AlabialVvclar , AvclarVlabial兲 were independent variables and A-match, V-match, and AV-fusion were dependent variables. Results from these analyses are reported in Table II. Results for POA stimulus structure are not discussed in the current article. Post hoc analyses 共paired-samples t-tests兲 were performed using the Bonferroni correction for multiple comparisons and a 5% rejection level. Figure 3 shows the percent A-match, V-match, and AVfusion for voiced and voiceless stimuli in white and babble noise at 0 and −12 dB SNRs. The analyses show significant main effects of noise type, noise level, and voicing for A-match and V-match. A significant main effect for AVfusion was only obtained for voicing. In addition, the interaction between these factors was significant for A-match, V-match, and AV-fusion 共see Table II兲. (a) Noise type. The quality of the background noise clearly influenced the participants’ reliance on the audio and visual component. Figure 3 shows that participants generally relied less on the audio component in white noise 共M = 23%, SE = 2.74兲 than in babble noise 共M = 45%, SE = 2.04兲. Post hoc analyses revealed that white noise resulted in sigAlm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

TABLE II. Statistical results for the three sets of repeated measures ANOVAs for the POA stimuli, showing main effects and interactions of noise type, noise level and voicing for A-match, V-match, and AV-fusion. Place of articulation stimuli Auditory match Factor共s兲 Noise type Noise level Voicing Noise type⫻ noise level Noise type⫻ voicing Noise level⫻ voicing Noise type⫻ noise level⫻ voicing

AVfusion

df

F

P

F

P

F

P

1,14 1,14 1,14 1,14 1,14 1,14 1,14

196.98 117.09 53.80 0.44 13.13 0.00 14.30

⬍0.001 ⬍0.001 ⬍0.001 n.s ⬍0.003 n.s ⬍0.002

80.71 206.92 135.52 1.07 0.54 0.05 31.17

⬍0.001 ⬍0.001 ⬍0.001 n.s n.s n.s ⬍0.001

2.63 2.01 22.15 4.55 5.52 0.11 13.13

n.s n.s ⬍0.001 n.s ⬍0.034 n.s ⬍0.003

nificantly fewer A-match responses than babble noise for voiced 共p ⬍ 0.024兲 and voiceless stimuli 共p ⬍ 0.012兲 in the 0 dB SNR, and voiced stimuli in the −12 dB SNR 共p ⬍ 0.012兲, but not for voiceless stimuli in the −12 dB SNR. The responses matched the visual component to a significantly greater extent in white noise 共M = 66%, SE = 4.27兲 than babble noise 共M = 40%, SE = 3.55兲. Post hoc analyses revealed no significant noise type effect for voiced stimuli in the 0 dB SNR, but white noise resulted in significantly more V-match responses than babble noise for voiceless stimuli in the 0 dB SNR 共p ⬍ 0.012兲 and for voiced stimuli in the −12 dB SNR 共p ⬍ 0.012兲. (b) Noise level. Consistent with previous findings 共e.g., Erber, 1969; MacLeod and Summerfield, 1987; Ross et al., 2007兲, noise level greatly influenced the way participants utilized audio and visual cues. The increase in noise level from 0 to − 12 dB SNR resulted in participant responses corresponding more with the visual component and less with the audio component, and this pattern was evident for all types of noise and voicing. As illustrated in Fig. 3, responses match the audio component to a significantly greater extent in the 0 dB SNR 共M = 52%, SE = 3.66兲, than in the −12 dB SNR 共M = 15%, SE = 1.64兲. Post hoc analyses revealed that the 0 dB SNR resulted in more A-match responses than the −12 dB SNR for voiced 共p ⬍ 0.012兲 and voiceless stimuli in white noise 共p ⬍ 0.012兲, and for voiced 共p ⬍ 0.012兲 and voiceless stimuli in babble noise 共p ⬍ 0.012兲. The noise level effect on V-match was opposite that of A-match. (c) Consonant voicing. As can be seen from Fig. 3, consonant voicing clearly influenced perception of the AV syllables. Surprisingly, voiced consonants resulted in more A-match responses 共M = 45%, SE = 2.8兲 than did voiceless consonants 共M = 23%, SE = 2.65兲. This result contrasts with previous findings 共e.g., Behne et al., 2006; McGurk and MacDonald, 1976兲. The most remarkable observation in this respect is the size of the difference in mean A-match responses between voiced and voiceless consonants, and the voicing effect’s consistency across conditions. Voiced stimuli resulted in significantly more A-match responses than voiceless stimuli in white noise in the 0 dB SNR 共p ⬍ 0.012兲, in babble noise in the 0 dB SNR 共p ⬍ 0.024兲, and in babble J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

Visual match

noise in the −12 dB SNR 共p ⬍ 0.012兲, but not in white noise in the −12 dB SNR. Responses match the visual component to a lesser extent for voiced consonants 共M = 36%, SE = 3.76兲 than for voiceless consonants 共M = 69%, SE = 4.05兲. Voiced stimuli resulted in fewer V-match responses than voiceless stimuli in white noise in the 0 dB SNR 共p ⬍ 0.012兲, in white noise in the −12 dB SNR 共p ⬍ 0.036兲, and in babble noise in the −12 dB SNR 共p ⬍ 0.012兲. AV-fusion data replicate previous findings 共e.g., Green and Kuhl, 1991; McGurk and MacDonald, 1976兲 demonstrating fewer AV-fusion responses for voiceless consonants 共M = 9%, SE = 2.47兲 than for voiced consonants 共M = 19%, SE = 3.42兲. However, the effect of consonant voicing shows that the degree of AV-fusion is dependent on noise type; that is, only white noise led to a significant voicing effect. Post hoc analyses showed reliably more AV-fusion responses for voiced consonants than voiceless consonants in white noise in the 0 dB SNR 共p ⬍ 0.012兲. (d) Summary of POA stimuli results. In general participants gave fewer A-match and more V-match responses in white noise than in babble noise. The 0 dB SNR resulted in more A-match responses and fewer V-match responses compared to the −12 dB SNR. Considering consonant voicing, more A-match and AV-fusion responses and fewer V-match responses were given for voiced than for voiceless consonants. 3. Voicing stimuli

Data for voicing stimuli 共Table I兲 were analyzed with two repeated measures ANOVAs where noise type 共white and babble兲, noise level 共0 and −12 dB SNRs兲, POA 共labial, alveolar, and velar兲, and voicing stimulus structure 共AvoicedVvoiceless , AvoicelessVvoiced兲 were independent variables and A-match and V-match were dependent variables. The results of these analyses are reported in Table III. Results for voicing stimulus structure are not discussed in the current article. Note that only two response categories 共i.e., A-match and V-match兲 were available for voicing stimuli and any response had to fall into one of the two. Thus, when the percentage A-match is found, the percentage V-match is already known. A significant effect for A-match therefore implies a Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

383

TABLE III. Statistical results for the two sets of repeated measures ANOVAs for the voicing stimuli, showing main effects and interactions of noise type, noise level, and place of articulation for A-match and V-match,. Voicing stimuli Auditory or visual match Factor共s兲 Noise Noise POA Noise Noise Noise Noise

type level type⫻ noise level type⫻ POA level⫻ POA type⫻ noise level⫻ POA

df

F

P

1,14 1,14 2,28 1,14 2,28 2,28 2,28

33.88 100.46 17.01 41.88 20.79 19.09 18.97

⬍.001 ⬍.001 ⬍.001 ⬍.001 ⬍.001 ⬍.001 ⬍.001

significant effect for V-match. Results are therefore only described in detail for A-match, although hereby implicating the opposite effect for V-match. Post hoc analyses 共pairedsamples t-tests兲 were performed using Bonferroni correction for multiple comparisons and a 5% rejection level. Analyses show a main effect of noise type, noise level and POA, in addition to an interaction between noise type, noise level, and POA 共see Table III兲. Figure 4 depicts these results. (a) Noise level. In the 0 dB SNR condition, participants gave 100% A-match responses whereas, as Fig. 4 illustrates, significantly fewer responses matched the auditory component for the −12 dB SNR 共M = 87%, SE = 1.27兲. Figure 4 clearly reveals the robustness of voicing in noise and hence replicates previous research 共e.g., Jiang et al., 2006; Miller and Nicely, 1955兲. Post hoc analyses revealed that the 0 dB SNR resulted in significantly more A-match responses than the −12 dB SNR for alveolar stimuli in white noise 共p ⬍ 0.024兲 and for labial 共p ⬍ 0.024兲 and alveolar stimuli in babble noise 共p ⬍ 0.024兲.

(b) Noise type. As depicted in Fig. 4, responses matched the auditory component to a significantly greater extent in white noise 共M = 97%, SE = 0.56兲 than in babble noise 共M = 90%, SE = 0.76兲. No significant noise type differences were found for the 0 dB SNR where 100% A-match responses was observed across conditions. At the −12 dB SNR, however, post hoc analyses revealed that white noise resulted in significantly more A-match responses than babble noise for labial 共p ⬍ 0.024兲 stimuli. (c) POA. As shown in Fig. 4, most A-match responses were observed for velar stimuli 共M = 98%, SE = 0.6兲, followed by alveolar 共M = 93%, SE = 1.03兲 and labial stimuli 共M = 89%, SE = 1.48兲. No significant POA differences were found in the 0 dB SNR. Whereas POA differences in white noise in the −12 dB SNR did not reach significance, post hoc analyses revealed significant POA differences for babble noise in the −12 dB SNR. Here, labial consonants resulted in fewer A-match responses than velar consonants 共p ⬍ 0.024兲, whereas alveolar consonants resulted in fewer A-match responses than velar consonants 共p ⬍ 0.024兲. (d) Summary of voicing results. In general significantly more A-match responses were obtained in the 0 dB SNR than in the −12 dB SNR. White noise generally resulted in more A-match responses than did babble noise. For POA, fewer A-match responses were found for labial stimuli than for alveolar and velar stimuli. IV. DISCUSSION

The present study used white and babble noise to test AV perception of POA and voicing of stop consonants. POA and voicing stimuli were used because these phonetic attributes are known to be differentially susceptible to noise level 共e.g., Jiang et al., 2006; Miller and Nicely, 1955兲 and thus possibly unequally susceptible to noise type. Results from the POA stimuli revealed that POA identification was more susceptible to white noise than babble noise, whereas voicing stimuli showed that voicing identifi-

FIG. 4. Percent A-match and V-match responses for labial, alveolar, and velar stimuli, presented in babble and white noise in the −12 dB SNR. 384

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

cation was more susceptible to babble noise than white noise. When the POA of the visual and audio components differed, white noise resulted in significantly fewer auditory responses 共23%兲 than did babble noise 共45%兲. POA’s susceptibility to white noise is consistent with previous research 共Jiang et al., 2006; Miller and Nicely, 1955兲 and supports the notion of a “glimpsing effect;” that is, white noise occludes the target signal more than babble noise because it allows fewer acoustical windows open to the perceiver 共e.g., Carhart et al., 1969; Cooke, 2006; Miller and Lickider, 1950兲. White noise has a flat intensity across frequencies and covers a wider frequency range than babble noise, and the speech signal is consequently less perceptually accessible in white noise than in babble noise. The analysis of noise type effect on voicing revealed an interesting new finding. For voicing stimuli, the responses matched the auditory component to a greater extent for white noise than for babble noise. Previous research with AO stimuli has shown that perception of voicing is very robust in white noise and is nearly unaffected up to the threshold −15 dB SNR 共Jiang et al., 2006兲. The results for white noise stimuli in the current study replicate these findings, with participants almost solely responding consistent with the auditory component 共97%兲. However, for babble noise, only 90% of the responses matched the auditory component. This difference is substantial considering that voicing identification is known to be very robust in noise 共Jiang et al., 2006; Miller and Nicely, 1955兲. The greater negative effect of babble noise on auditory responses challenges whether the glimpsing effect 共i.e., the proportion of the frequency range which is masked兲 can fully account for the effect of noise. The surprising difference between the effects of white and babble noise on POA and voicing identification can be considered from the perspective of peripheral masking. Peripheral masking takes place when noise and a target signal overlap in time and frequency, making parts of the target signal inaudible 共Watson and Kelly, 1981兲. “Peripheral” denotes that this kind of interference distorts the clarity of the target signal on the auditory nerve and refers to the acoustical characteristics of the target signal and the noise. Both phonetic attributes 共i.e., POA and voicing兲 and noise types 共i.e., white and babble noise兲 differ in acoustical characteristics. The formant transitions between the consonant and the vowel in a consonant-vowel syllable yield important cues for identifying POA 共Blumstein et al., 1982; Stevens and Blumstein, 1978兲 and especially the F2 and F3 transition 共Delattre et al., 1955; Liberman et al., 1954兲. An essential cue for identification of voicing is VOT 共e.g., Eimas and Corbit, 1973; Lisker and Abramson, 1964; Massaro and Oden, 1980兲, defined as the time that elapses between consonant release and when the vocal folds begin to vibrate 共Lisker and Abramson, 1964兲. The aspiration associated with VOT is characterized by a flatter power spectral density than the formant transition, which in turn is characterized by high intensity levels at the frequencies associated with formants. Consequently, the transition between the aspiration and the voice onset is characterized by the signal shifting from a relatively flat intensity distribution across frequencies 共aspiJ. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

ration兲 to a more fluctuating intensity distribution across frequencies 共formants in connection with voice onset兲. Babble noise has a fluctuating intensity with a lowfrequency dominated spectral shape, while white noise is stationary and has a flat spectral shape, as illustrated in Fig. 1. The fluctuating nature of the babble implies that its momentary level might vary by several decibels above or below the average level, whereas the intensity of the white noise varies little over time. In identification of voicing, babble noise may interfere more with the target signal than white noise because high energy densities at low frequencies obscure the transition between the aspiration and the voice onset in the target signal; that is, babble noise makes it harder to discern when the target signal shifts from relatively flat intensity 共aspiration兲 to a more fluctuating intensity 共voice onset兲 by adding fluctuation to the aspiration. The onset of F1 共0 – 1000 Hz兲 may, for example, be hard to distinguish in babble noise, because the babble noise has high energy density around the same frequency band as F1. Babble noise may thus obscure temporal differences in VOT between voiced and voiceless signals by making it harder to perceive the point in time when the aspiration stops and the voicing starts. Considering the effect of POA on perception of voicing stimuli, an interesting pattern emerges. First, voicing is robust in noise and hence only the −12 dB SNR revealed a significant POA effect. Whereas babble noise revealed significant differences in use of modality between labial, alveolar, and velar stimuli, no such differences were obtained for white noise. To the authors’ knowledge, an effect of POA on use of modality in perception of voicing has not previously been reported, although some research 共e.g., Lisker and Abramson, 1970兲 indicates an interaction between POA and voicing. In the current study labial consonants resulted in more visual responses than alveolar and velar consonants, respectively, in babble noise of −12 dB SNR. Research 共e.g., Bengherel and Pichora-Fuller, 1982; Walden et al., 1977兲 shows that labial consonants are more visually salient than velar consonants, since production of labial consonants provokes more prominent facial movements than velar consonants. Furthermore, research 共Cooper et al., 1952; Liberman, 1952兲 has demonstrated that the spectrum of the stop burst influences the way POA is perceived. The burst spectrum of labial stops has low frequency dominance, whereas the bursts of alveolar and velar stops have high and midfrequency dominance, respectively. Babble noise is characterized by high spectral densities at low frequencies and thus occlude the burst of labial stops more than alveolar and velar stops. In the current study overwhelmingly more auditory responses were given when the POA stimuli were voiced compared to voiceless. For voiced stimuli 45% of the responses were auditory compared to only 23% for voiceless stimuli. This effect was consistent across noise conditions; that is, voiced POA stimuli resulted in more auditory responses than voiceless POA stimuli in both noise types and in both noise levels. How may voicing change the perception of POA? Jackson 共2001兲 showed that at voice onset the auditory distincAlm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

385

tiveness between voiced stop consonants is more prominent than it is for voiceless stop consonants. It may be argued that voiced stops are more auditorily prominent than voiceless stops because the production of voiced stops is characterized by vocal activity, absent in word-initial aspirated voiceless stops. The spectral distribution of aspiration in voiceless stops is in many respect reminiscent of white noise. White noise conditions may therefore pose a more serious threat to accurate auditory perception of voiceless syllables than it does to voiced syllables, since the spectral similarity between the white noise and the target stimuli enhances the masking effect 共i.e., peripheral masking兲 共e.g., Watson and Kelly, 1981兲. Indeed, fewer responses matched the voiceless auditory stimuli in conditions with white noise than babble noise. This was especially true for the 0 dB SNR, where only 27% of the responses matched the auditory component in white noise, while 53% of the responses matched the auditory component in babble noise. V. CONCLUDING REMARKS

The contribution of the auditory and visual modality to speech perception is highly dependent on noise type and phonetic attributes. The two modalities are used to a different degree in white and babble noise, for voiced and voiceless consonants and for labial, alveolar, and velar consonant. The current study took AVSP research further by assessing the effect of babble noise on speech perception. Previous research has mainly focused on white noise, although babble noise arguably is a more common distractor in natural speech communication. The results for noise type revealed that voicing identification is more susceptible than predicted from research based solely on white noise. Surprisingly, and contrary to the findings for POA identification, the identification of voicing is susceptible to babble noise, but not to white noise. The results also indicate interdependency between voicing and POA in babble noise; that is, the use of modality in identification of voicing depends on the POA of the target signal. The difference between white and babble noise found in this study is most likely attributed to peripheral masking. This study supports previous research in demonstrating the robustness and flexibility of human speech perception and that these qualities are much related to its natural bimodal character. An integrated, acute, and highly adaptable system of AVSP is demonstrated by the way changes in the speech signal and acoustical context affect the contribution of auditory and visual information to speech perception. ACKNOWLEDGMENTS

The authors wish to thank Peter Svensson for his assistance with the acoustic analyses and for his judicious comments. They would also like to thank the Associated Editor Allard Jongman and the anonymous reviewers for their insightful comments and suggestions. Behne, D., Wang, Y., Alm, M., Arntsen, I., Eg, R., and Valsø, A. 共2006兲. “Audio-visual speech perception with age in quiet and café noise,” poster presented at the Joint Meeting of the Acoustical Society of America and Acoustical Society of Japan, Honolulu, HI. Behne, D., Wang, Y., Alm, M., Arntsen, I., Eg, R., and Valsø, A. 共2007兲. 386

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

“Changes in audio-visual speech perception during adulthood,” in Proceedings of the International Conference of Audio-Visual Speech Processing 共AVSP兲, Hilvarenbeek, The Netherlands. Bengherel, A. P., and Pichora-Fuller, M. K. 共1982兲. “Coarticulation effects in lipreading,” J. Speech Hear. Res. 25, 600–607. Binnie, C. A., Montgomery, A. A., and Jackson, P. L. 共1974兲. “Auditory and visual contributions to the perception of consonants,” J. Speech Hear. Res. 17, 619–630. Blumstein, S., Isaacs, E., and Mertus, J. 共1982兲. “The role of the gross spectral shape as a perceptual cue to place of articulation in initial stop consonants,” J. Acoust. Soc. Am. 72, 43–50. Boersma, P., and Weenink, D. 共2006兲. “Praat: Doing phonetics by computer 共Version 4.3.22兲,” from http://www.praat.org/ 共Last viewed February 6, 2006兲. Carhart, R., Tillman, T. W., and Greetis, E. S. 共1969兲. “Perceptual masking in multiple sound backgrounds,” J. Acoust. Soc. Am. 45, 694–703. Cooke, M. P. 共2006兲. “A glimpsing model of speech perception in noise,” J. Acoust. Soc. Am. 119, 1562–1573. Cooper, F., Delattre, P., Liberman, A., Borst, J., and Gerstman, L. 共1952兲. “Some experiments on perception of synthetic speech sounds,” J. Acoust. Soc. Am. 24, 597–606. Delattre, P. C., Liberman, A. M., and Cooper, F. S. 共1955兲. “Acoustic loci and transitional cues for consonants,” J. Acoust. Soc. Am. 27, 769–773. Dodd, B. 共1977兲. “The role of vision in the perception of speech,” Perception 6, 31–40. Easton, R. D., and Basala, M. 共1982兲. “Perceptual dominance during lip reading,” Percept. Psychophys. 32, 562–570. Eimas, P. D., and Corbit, J. D. 共1973兲. “Selective adaptation of linguistic feature detectors,” Cognit Psychol. 4, 99–109. Erber, N. P. 共1969兲. “Interaction of audition and vision in the recognition of oral speech stimuli,” J. Speech Hear. Res. 12, 423–425. Fixmer, E., and Hawkins, S. 共1998兲. “The influence of quality of information on the McGurk effect,” in Proceedings of the International Conference on Auditory-Visual Speech Processing 共AVSP兲, edited by D. Burnham, J. Robert-Ribes, and E. Vatikiotis-Bateson, Terrigal, Australia. Green, K. P., and Kuhl, P. K. 共1989兲. “The role of visual information in the processing of place and manner features in speech perception,” Percept. Psychophys. 45, 34–42. Halvorsen, B. 共1998兲. “Timing relations in Norwegian stops,” thesis, University of Bergen. Jackson, P. J. B. 共2001兲. “Acoustic cues of voiced and voiceless plosives for determining place of articulation,” in Proceedings Workshop on Consistent and Reliable Acoustic Cues for Sound Analysis, CRAC 2001, Aalborg, Denmark, pp. 19–22. Jiang, J., Chen, M., and Alwan, A. 共2006兲. “On the perception of voicing in syllable-initial plosives in noise,” J. Acoust. Soc. Am. 119, 1092–1105. Lansing, C. R., and McConkie, G. W. 共1999兲. “Attention to facial regions in segmental and prosodic visual speech perception tasks,” J. Speech Lang. Hear. Res. 42, 526–539. Liberman, A., Delattre, P., and Cooper, F. 共1952兲. “The role of selected stimulus variables in the perception of unvoiced stop consonants,” Am. J. Psychol. 65, 497–516. Liberman, A. M., Delattre, P. C., Cooper, F. S., and Gerstman, L. J. 共1954兲. “The role of consonant-vowel transitions in the perception of the stop and nasal consonants,” Psychol. Monogr. 68, 1–13. Lisker, L., and Abramson, A. S. 共1964兲. “A cross-language study of voicing in initial stops: Acoustical measurements,” Word 20, 384–422. Lisker, L., and Abramson, A. S. 共1970兲. “The voicing dimension: Some experiments in comparative phonetics,” in Proceedings of the Sixth International Congress of Phonetic Sciences, Prague 共Academia, Prague兲, pp. 563–567. MacDonald, J., and McGurk, H. 共1978兲. “Visual influences on speech perception process,” Percept. Psychophys. 24, 253–257. MacLeod, A., and Summerfield, Q. 共1987兲. “Quantifying the contribution of vision to speech perception in noise,” Br. J. Audiol. 21, 131–141. MacLeod, A., and Summerfield, A. Q. 共1990兲. “A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use,” Br. J. Audiol. 24, 29–43. Markides, A. 共1989兲. “Background noise and lip-reading ability,” Br. J. Audiol. 23, 251–253. Massaro, D. W. 共1987兲. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry 共Erlbaum, Hillsdale, NJ兲. Massaro, D. W., and Light, J. 共2004兲. “Using visible speech to train percepAlm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

tion and production of speech for individuals with hearing loss,” J. Speech Lang. Hear. Res. 47, 304–320. Massaro, D. W., and Oden, G. 共1980兲. “Evaluation and integration of acoustic features in speech perception,” J. Acoust. Soc. Am. 67, 996–1013. McGurk, H., and MacDonald, J. 共1976兲. “Hearing lips and seeing voices,” Nature 共London兲 264, 746–748. Miller, G. A., and Lickider, J. C. R. 共1950兲. “The intelligibility of interrupted speech,” J. Acoust. Soc. Am. 22, 167–173. Miller, G. A., and Nicely, P. E. 共1955兲. “An analysis of perceptual confusions among some English consonants,” J. Acoust. Soc. Am. 27, 338–352. Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., and Foxe, J. J. 共2007兲. “Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,” Cereb. Cortex 17, 1147– 1153. Sommers, M. S., Spehar, B., and Tye-Murray, N. 共2005兲. “The effect of signal-to-noise ratio on auditory-visual integration: Integration and encoding are not independent,” J. Acoust. Soc. Am. 117, 2574.

J. Acoust. Soc. Am., Vol. 126, No. 1, July 2009

Stevens, K., and Blumstein, S. 共1978兲. “Invariant cues for place of articulation in stop consonants,” J. Acoust. Soc. Am. 64, 1358–1368. Sumby, W. H., and Pollack, I. 共1954兲. “Visual contribution to speech intelligibility in noise,” J. Acoust. Soc. Am. 26, 212–215. Summerfield, Q., and McGrath, M. 共1984兲. “Detection and resolution of audio-visual incompatibility in the perception of vowels,” Q. J. Exp. Psychol. 36A, 51–74. Walden, B. E., Prosek, R. A., and Montgomery, A. A. 共1977兲. “Effects of training on the visual recognition of consonants,” J. Speech Hear. Res. 20, 130–145. Wang, Y., Behne, D., and Jiang, H. 共2008兲. “Linguistic experience and audio-visual perception of non-native fricatives,” J. Acoust. Soc. Am. 124, 1716–1726. Watson, C. S., and Kelly, W. J. 共1981兲. “The role of stimulus uncertainty in the discrimination of auditory patterns,” in Auditory and Visual Pattern Recognition, edited by D. J. Getty and J. H. Howards 共Erlbaum, Hillsdale, NJ兲, pp. 37–59.

Alm et al.: Place and voicing in noise

Downloaded 05 Mar 2012 to 142.58.126.186. Redistribution subject to ASA license or copyright; see http://asadl.org/journals/doc/ASALIB-home/info/terms.jsp

387