Auditory-Visual Speech Perception and Aging - CiteSeerX

4 downloads 0 Views 403KB Size Report
(Jerger, Jerger, Oliver, & Pirozzolo, 1989) play a limited role in the ..... 5.50, p. 0.0078]. Overall, there was a large visual bias effect (90%) for all three participant groups who viewed ..... icut, 850 Bolton Road, Box U-85, Storrs, CT, 06269. E-mail:.
Auditory-Visual Speech Perception and Aging Kathleen M. Cienkowski and Arlene Earley Carney

Objective: This experiment was designed to assess the integration of auditory and visual information for speech perception in older adults. The integration of place and voicing information was assessed across modalities using the McGurk effect. The following questions were addressed: 1) Are older adults as successful as younger adults at integrating auditory and visual information for speech perception? 2) Is successful integration of this information related to lipreading performance?

Bioacoustics and Biomechanics, 1988). Loss of sensitivity in the peripheral auditory system, specifically poorer hearing thresholds, contributes substantially to this drop in performance (Humes & Roberts, 1990; Kalikow, Steven, & Elliot, 1977). Some investigators have suggested that changes in cognitive (Pichora-Fuller, Schneider, & Daneman, 1995) or central auditory processing abilities (Jerger, Jerger, Oliver, & Pirozzolo, 1989) play a limited role in the decreased recognition of speech as well. The combination of visual cues from lipreading with the auditory speech signal adds another sensory source to the original signal; it also requires that the listener integrate information from each source to maximize performance (Grant & Seitz, 1998). Consequently, the investigation of auditoryplus-visual speech perception in older adults may provide further support for peripheral or central explanations of changes in speech perception that occur with aging. Rehabilitation strategies that help individuals compensate for an impoverished auditory signal often encourage speechreading, the use of visual cues to supplement the auditory component (Alpiner & McCarthy, 1987). The addition of visual cues to an unintelligible auditory signal can enhance overall speech perception significantly. Sumby and Pollack (1954) demonstrated that the addition of visual cues could improve the perception of speech by the equivalent of a 5 to 18 dB increase in the signal-to-noise (S/N) ratio; that is, an effective change of up to 60% in word recognition was observed depending on the materials used. Subsequently, Erber (1969) found that word recognition scores for young adults improved from 20% in the auditory-only condition to 80% in the auditory-plus-visual condition at a ⫺10 dB S/N ratio. Middelweerd and Plomp (1984) and Summerfield (1979) showed similar results for sentence materials. McGurk and MacDonald (1976) demonstrated that vision not only enhances speech perception but can, in fact, alter perception. In their experiment, an auditory /ba/ was dubbed onto a visual /ga/. The participants overwhelmingly perceived the new combined stimulus as /da/. This fusion of speech information across modalities, is known as the “McGurk effect.” A stimulus with conflicting auditory and visual information frequently results in a response that contains certain phonetic aspects of

Design: The performance of three groups of participants was compared: young adults with normal hearing and vision, older adults with normal to near-normal hearing and vision, and young controls, whose hearing thresholds were shifted with noise to match the older adults. Each participant completed a lipreading test and auditory and auditory-plus-visual identification of syllables with conflicting auditory and visual cues. Results: The results show that on average older adults are as successful as young adults at integrating auditory and visual information for speech perception at the syllable level. The number of fused responses did not differ for the CV tokens across the ages tested. Although there were no significant differences between groups for integration at the syllable level, there were differences in the response alternatives chosen. Young adults with normal peripheral sensitivity often chose an auditory alternative whereas, older adults and control participants leaned toward visual alternatives. In additions, older adults demonstrated poorer lipreading performance than their younger counterparts. This was not related to successful integration of information at the syllable level. Conclusions: Based on the findings of this study, when auditory and visual integration of speech information fails to occur, producing a nonfused response, participants select an alternative response from the modality with the least ambiguous signal. (Ear & Hearing 2002;23;439– 449)

As individuals age, there are changes in the auditory system that result in a decrease in speech recognition abilities (Committee on Hearing and Department of Communication Sciences (K.M.C.), University of Connecticut, Storrs, Connecticut; and University of Minnesota (A.E.C.), Minneapolis, Minnesota. DOI: 10.1097/01.AUD.0000034781.95122.15

0196/0202/02/2305-0439/0 • Ear & Hearing • Copyright © 2002 by Lippincott Williams & Wilkins • Printed in the U.S.A. 439

440 place, manner, and voicing from each original modality. The effect is so robust that it is not diminished by attention directed to only the auditory component (MacLeod & Summerfield, 1987) or practice (Massaro, 1984). Further, if a male voice is dubbed onto a female face and vice versa, the gender incompatibilities do not reduce the effect (Green, Kuhl, Metzoff, & Stevens, 1991). The existence of the McGurk effect provides further evidence that auditory and visual information interact in the perception of speech. For the AV perception of consonants, manner of articulation and voicing are transmitted most efficiently from the auditory portion of the stimulus, whereas place of articulation is transmitted best by the visual portion (Grant & Seitz, 1998; Grant, Walden, & Seitz, 1998). Grant and Seitz studied auditory and visual cue integration in 41 listeners with hearing losses and categorized participant responses along a continuum. In their view, optimal cue integration, or equal impact of auditory and visual inputs, occurs when discrepant auditory and visual inputs (e.g., /bi/ ⫹ /gi/) result in a clearly fused response such as /di/. By their definition, if a response is not fused, optimal integration does not occur; specifically, one modality is dominant over the other. However, previous investigations that have looked at the McGurk effect in young adults with normal peripheral visual and auditory sensitivity showed wide variation in the prevalence of fused responses. Fused responses to the combination of /ba/ ⫹ /ga/ (e.g., /da/ or /tha/) have been as high as 98% (McGurk & MacDonald, 1976) and as low as 50% (Green et al., 1991) for group data. Green et al. (1991) evaluated the McGurk effect using the vowels /a/ and /i/. They found for the combinations of /ba/ ⫹ /ga/ and /bi/ ⫹ /gi/ that the vowel /i/ resulted in more fused responses than /a/. Thus, even among young adults, optimal cue integration (Grant & Seitz, 1998), as evidenced by complete fusion of information across discrepant auditory and visual stimuli, is not consistent. Degree of auditory-visual fusion has not been examined in older adults with normal or near-normal hearing. The participants studied by Grant and Seitz ranged in age from 41 to 76 yr, with a mean age of 66 yr, and had mild-to-severe sensorineural losses with sloping configurations. Older adults may be at a disadvantage for processing visual, as well as auditory, speech information. Shoop and Binnie (1979) compared the visual recognition abilities of middle-aged and elderly adults with normal hearing using the CID Everyday Sentences. They found that as age increased, percent correct identification of key words in the visualonly (VO) condition decreased. Middelweerd and Plomp (1984) compared the S/N ratio at which 50%

EAR & HEARING / OCTOBER 2002 recognition of Dutch sentences was achieved in an auditory-only (AO) and auditory-visual (AV) or speechreading condition. The average benefit of speechreading was 4.6 dB for young adults and 4.0 dB for older adults. Older adults seem to be at a slight disadvantage for using visual information. Walden, Busacco, and Montgomery (1993) evaluated the benefit achieved by the addition of visual cues for CV and sentence materials in middle-aged and elderly adults with hearing losses. Using a subset of the CID Everyday Sentences, they also found the performance of the older adults fell below that of the middle-aged group (16.7% versus 34.4%) in the visual-only condition. However, AV benefit remained the same across groups. Although these studies showed that speechreading performance decreases with age, even when visual acuity is controlled, the relationship of speechreading performance per se and integration of auditory and visual cues for speech perception is not as well understood. Grant and Walden (1996) suggested that individuals who obtain optimal auditory-visual or bimodal recognition are able to successfully integrate unimodal information. For those with less than optimal auditory or visual information (e.g., older adults), auditory or visual inputs individually may be less likely to influence the bimodal perception of speech. Cavanaugh (1998) described a two-stage process of perceptual integration in young and older adults: the extraction or the encoding of important stimulus characteristics and the integration or combination of those characteristics in a meaningful way. Aging might affect these two stages differentially. Deficits in the peripheral mechanism may limit the amount of useful information that can be extracted from the input stimulus whereas changes in the central system may affect areas including information processing, memory, and retrieval (Humes & Christopherson, 1991). Plude and Hoyer (1985) suggested that the ability to extract information remains unaffected by age but that older adults are less successful at integrating information than younger adults. The use of conflicting auditory and visual speech stimuli might provide a test of integration performance for older adults. If older adults were selected with normal or near-normal visual and auditory sensory systems, it is likely that their information extraction capabilities would be similar to those of younger adults. Consequently, if the integration performance of the older adults were poorer than that of young adults, changes in central processing mechanisms may be suspect. Conversely, if their integration performance were equal to that of young adults, then the contribution of the peripheral sys-

EAR & HEARING, VOL. 23 NO. 5

441

Table 1. Means (and standard deviations) of age and hearing levels in dB HL re ANSI (1989) for the young, old, and control participants. Group

Age

500 Hz

1k Hz

2k Hz

3k Hz

Young Old Control

22.3 (6.9) 70.1 (3.5) 21.7 (2.9)

2.5 (4.2) 7.9 (6.2) 4.8 (4.0)

2.1 (3.6) 6.7 (3.5) 3.5 (4.0)

1.9 (3.9) 9.6 (7.2) 4.2 (4.5)

27.1 (7.2)

4k Hz 3.3 (6.0) 29.0 (21.3) 3.7 (5.9)

6k Hz 50.8 (18.8)

8k Hz 4.4 (5.2) 51.7 (26.7) 2.5 (4.9)

Interoctave frequencies of 3k and 6k Hz were not tested where thresholds at neighboring frequencies differed by less than 20 dB HL.

tems is likely to be the most important factor underlying integration. The current investigation tests two hypotheses. First, the most significant contribution to auditoryvisual integration of speech comes from the individual unimodal systems, regardless of the age of the perceiver. Moreover, if older and younger listeners have matched auditory and visual peripheral sensitivity, they should have equivalent ability to extract the stimulus information. If, under these conditions, older adults are as susceptible as young adults to the McGurk illusion, this would suggest that the integration of auditory and visual information for the perception of speech occurs peripherally, before a higher level of cognitive integration. On the other hand, if older adults are less successful at feature integration as indicated by a fused response to the McGurk stimuli, they may be less susceptible to the illusion. Aging then may affect more central levels of speech processing. Second, integration of auditory and visual speech cues in a conflicting condition, such as that observed for the McGurk effect, is unrelated to lipreading (visual-only) performance, regardless of age. The addition of visual cues can enhance the understanding of speech in noise by as much as 60% for sentence materials. Yet, individual lipreading scores can range from less than 10% to over 70% correct identification of words in sentences (Summerfield, 1992). Studies that have investigated lipreading performance find only weak correlations between lipreading and cognitive functions such as intelligence (Summerfield, 1991). The effects of practice have shown mixed results. Dodd, Plant, and Gregory (1989) found limited benefits for practice. Others have shown that extensive training can improve correct recognition of connected discourse (DeFilippo, 1988) and message comprehension (Matthies & Carney, 1988). However, generally speaking, measures of lipreading performance seem to correlate only with other measures of lipreading performance. If individuals who are poor lipreaders are also less susceptible to the McGurk effect, this would suggest that the successful integration of auditoryvisual information for speech is associated with the successful processing of the visual input. If one considers lipreading as an innate cognitive ability (Summerfield, 1992) and if lipreading is linked with

auditory-visual integration, this would support the role of cognitive or top-down processing in the bimodal perception of speech before phonetic identification. If lipreading performance were unrelated to auditory-visual integration, it would suggest that more peripheral mechanisms are important for successful integration or that the two processes are unrelated as suggested by Braida (1991) and Massaro and Cohen (2000).

METHODS Participants All participants received hearing and vision screening initially as part of participant selection. Hearing thresholds were measured at the octave intervals from 500 to 8000 Hz bilaterally. Two aspects of visual ability were assessed: visual acuity and contrast sensitivity. Visual acuity was measured for both eyes together using a Snellen chart at the standard distance of 20 feet (Karp, 1988). A Pelli-Robson chart was used to determine contrast sensitivity (Pelli, Robson, & Wilkins, 1988). It was set at a standard distance of 3 meters. Luminance levels for the Pelli-Robson chart were measured with a Minolta CS100 Chromameter. The values ranged from 62 to 90 lumens/meter2. These levels were well within the acceptable range for screening (Pelli et al.,1988). Hearing thresholds were measured using a portable audiometer (Beltone Model 119) in a double walled sound-treated booth. Hearing was considered normal if thresholds were better than 20 dB HL (ANSI, 1989) bilaterally at all test frequencies. Vision was considered normal if participants achieved a Snellen acuity of 20/25 with correction if needed and a contrast sensitivity measure of 1.80 (Kline & Schieber, 1985). Experimental participants were divided into three groups: young adults, older adults, and young controls. The young adult group consisted of 12 participants, 1 male and 11 female, from the ages of 18 to 35 yr (mean age 22.3 yr) with normal hearing and normal or corrected-to-normal vision, according to the criteria noted above. Mean hearing thresholds and standard deviations for the young, old, and control groups are displayed in Table 1. The older adult group consisted of 12 participants, 2 male and 10 female, from the ages of 65 to 74 yr (mean age of

442

EAR & HEARING / OCTOBER 2002

Table 2. Age, gender, and mean thresholds from both ears in dB HL re ANSI (1989) for individual older adult participants Ss

Age

Gender

500 Hz

1k Hz

2k Hz

3k Hz

4k Hz

6k Hz

8k Hz

1 2 3 4 5 6 7 8 9 10 11 12

73 72 68 73 72 66 73 66 74 74 65 65

Female Female Female Female Male Female Female Male Female Female Female Female

10 3 5 8 15 5 3 10 13 0 20 5

5 5 3 8 8 5 8 10 5 8 13 5

18 3 8 15 13 0 10 15 8 3 20 5

25

35 10 28 28 50 10 48 35 45 3 25 3

58

75 30 58 70 60 28 68 75 70 8 75 5

25 33 23 23

28 68

60 65 50 55

Interoctave frequencies of 3k and 6k Hz were not tested where thresholds at neighboring frequencies differed by less than 20 dB HL in older participants.

70.1 yr) with normal to near-normal hearing and normal or corrected-to-normal vision. Hearing thresholds of individual older participants are shown in Table 2. Older adult participants had hearing thresholds of ⱕ 35 dB HL for the frequencies from 500 to 3000 Hz. These thresholds may be considered normal for male and female listeners over the age of 65 based on the normative data published from the Baltimore Longitudinal Study of Aging (Morrell, Gordon-Salant, Pearson, Brant, & Fozard, 1996). These norms were collected on individuals who were screened for absence of ear disease and noise induced hearing loss. These normative thresholds, therefore, should reflect changes due to age rather than other otologic pathology. In the high frequencies, 4000, 6000, and 8000 Hz, the older adult participants have a wider range of hearing thresholds. At 4000 Hz, only 8 of the 12 participants have thresholds that are better than the median thresholds of individuals of the same age and gender. All participants, except 10 and 12, showed some degree of hearing loss at 6000 and/or 8000 Hz. Hearing loss at these frequencies is not uncommon with advancing age; however, normative data are not available (Morrell et al., 1996). Each participant was administered the Health Screening questionnaire developed by Christensen, Moye, Armson, and Kern (1992) as a selection tool. By self-report, these older adult participants were considered to have generally good health and mental/ cognitive status. The control group consisted of 12 young adult participants, 1 male and 11 female, from the ages of 18 to 35 yr (mean age of 21.7 yr). These participants also passed hearing and vision screening. The controls had thresholds that were shifted with a highpass noise that was generated with the Tucker Davis WG-1 waveform generator. The noise, with a passband of 3000 to 8000 Hz, was shaped by a Yamaha Q1027 graphic equalizer. The shaped noise

was recorded on audiotape for future playback. An overall level of 65 dB SPL, as presented through ER-3A insert phones, was determined to shift the auditory thresholds of the young adults. The thresholds were shifted to approximate hearing levels of 35, 50, and 75 dB HL at the test frequencies of 3000, 4000, and 8000 Hz. These represent the poorest thresholds of the older adults at those frequencies. The poorest thresholds were chosen as matches for the control participants because of the variability in thresholds for the older adults. When young control participants’ thresholds were shifted, the standard deviation of these thresholds still remained very small. The standard deviation for the older adults’ thresholds was considerably larger. Consequently, a match to the mean of the older adults’ thresholds would account for only a central tendency and not for the range of sensitivity. Matching young adult control with older adult participants’ thresholds individually would be inappropriate as well without matching them on individual lipreading performance and tendency to demonstrate fused responses in auditory-visual tasks. Recent results reported by Carney, Clement, and Cienkowski (1999) have shown that individual participants vary greatly in their fusion of stimuli. The underlying cause of this variability is under investigation. The 45-minute taped noise sample was played from an audio cassette deck (Nakamichi DR-3) and then routed through a Shure amplifier and programmable attenuator (Tucker Davis PA4) to a mixer (Tucker Davis SM3). There it was combined with the auditory-visual signal, routed through a Crown amplifier, and sent to a pair of loudspeakers (PSB, Amazing Alpha). Stimulus Preparation Stimuli • Stimuli consisted of the 10 lists of the Central Institute for the Deaf (CID) Everyday Sen-

EAR & HEARING, VOL. 23 NO. 5

443

TABLE 3. Auditory and visual tokens combined to create the McGurk effect Auditory Token /bi/ /pi/

Visual Token

Fused Response

Visually Biased Responses

/gi/ /ki/

/di/ /ti/

/si/, /␪i/, /␦i/, /i/, /hi/ /si/, /␪i/, /␦i/, /i/, /hi/

tences (Silverman & Hirsh, 1955) and the CV syllables /bi, gi, pi, ki/ produced by two talkers, one male and one female. Each was a native speaker of English with a General American dialect and without professional voice training. Stimuli were videotaped using a Panasonic WV-3260/8AF color video camera with a Super VHS videorecorder (JVC HRS5200U). The talkers were seated 1.5 meters from the camera against a neutral gray background. Illumination came from filtered fluorescent lighting along the room’s upper perimeter. The camera was set to record the talker’s face and shoulders. A template was applied so that each talker’s face measured approximately 5 inches in diameter on a 19-inch monitor. All recordings were made in a single walled sound treated suite. Each sentence and nonsense syllable was produced three times. Talkers were allowed practice to familiarize themselves with the items before recording. The single best item was selected for each sentence and nonsense syllable on the basis of overall articulatory quality and the lack of extraneous facial movements. Talkers recorded each of the 10 lists of the CID Everyday Sentences with a different randomization of sentences per list. CV stimuli were always recorded in the same order. The auditory-only tokens were /bi/, /pi/, /gi/, and /ki/, respectively, spoken naturally by each of the two talkers. The auditory-visual tokens were formed by combining the auditory tokens /bi/ and /pi/ with the visual tokens /gi/ and /ki/. Expected responses are shown in Table 3. For the voiced tokens, a /bi/ is a correct auditory response and a /gi/ is a correct visual response, because both /bi/ and /gi/ stimuli are actually presented. Similarly, for the voiceless tokens, /pi/ and /ki/ are correct responses for auditory and visual tokens, respectively. A /di/ or /ti/ response represents optimal fusion for voiced and voiceless stimuli where by, the voicing and manner of the auditory token are maintained. However, visual bias, or the influence of a visual stimulus on an auditory stimulus, can still be reflected in other responses as shown in the fourth column of Table 3. Stimulus Tape Preparation • Source videotapes were played on a Super VHS videorecorder (JVC HR-S7100U) and were sent to the input port of a video card on a Pentium laboratory computer. Original videotaped stimuli were digitized and edited

with Broadway 2.5 software (Data Translation Co.). Audio signals were digitized at a sampling rate of 11,000 Hz. After editing, stimuli were randomized and recorded by a JVC BR-5378U Super VHS deck. Each syllable was edited to include 1 sec of the talker in a neutral position both before and after each utterance along with 1-sec fades to and from video black. Ten seconds of video black were inserted between test items. For monosyllabic stimuli that had matched auditory and visual components, the stimuli were digitized and saved as a single integral file. For stimuli that had unmatched auditory and visual tracks (those created to assess the McGurk effect), the original auditory portion of that stimulus file was removed using the auditory filter option from the Broadway Audio editor and the desired auditory signal was inserted on a second audio track. The new auditory signal was aligned so that the consonant burst matched that of the original auditory signal with an alignment error less than 4 to 6 msec. The file was resaved as a new audiovisual file. The words, “Ready” and “Now” were inserted on the videotape before each stimulus to direct the participant’s attention to the task. For the sentence stimuli, each of the 100 sentences was edited to include talkers in the neutral position and fades in and out of video black. An on-screen title was inserted for each of the three test conditions, AO, VO, and AV. For the AO condition, the video portions of the sentence stimuli were replaced by video black. For the VO condition, the auditory portions of the stimuli were removed using an auditory filter. The AV condition contained intact versions of the AV stimuli. All sentence stimuli were saved as single audiovisual files. Procedures For all three groups of participants—young adults, older adults, and control—the CID Sentences were presented in three conditions: AO, VO, and AV. Three lists were presented in each condition. One list was presented in the AV condition for practice. Ten seconds of video black were inserted between test items. Participants were instructed to listen and/or watch each sentence and repeat out loud what they judged to have been said. The sentence intelligibility by condition was determined based on the number of key words correctly identified. The participants’ responses were scored by the experimenter at the time of testing. Participants were randomly assigned to view either the male or female talker. Two speakers were included to be consistent with experimental paradigms from existing literature in auditory-visual speech perception (see McGurk & MacDonald, 1976, and Green &

444

EAR & HEARING / OCTOBER 2002

TABLE 4. Syllable identification of dubbed voiced and voiceless auditory and visual stimuli (McGurk condition) expressed as a percentage of response choice for the male and female talker. Auditory /bi/ and Visual /gi/ Male Talker

Female Talker

Response

Young

Old

Control

Young

Old

Control

bi gi di ␪i si i Visual bias pi ki ti si hi di i Visual bias

19.58 8.75 70.83* 0.00 0.00 0.83 71.66 9.58 37.50 52.92 0.00 0.00 0.00 0.00 52.92

4.58 1.67 93.75 0.00 0.00 0.00 93.75 2.92 9.17 86.25 0.00 0.00 0.00 0.00 86.25

0.83 0.42 96.67 0.00 0.00 2.51 99.18 6.67 21.67 71.67 0.00 0.00 0.00 0.00 71.67

9.58 0.00 50.42 9.17 0.00 30.42 90.01 0.83 37.08 61.25 0.00 0.00 0.00 0.00 61.25

0.83 0.42 57.50 1.67 22.91 16.67 98.75 0.83 17.50 73.33 0.00 3.33 0.00 0.00 76.66

0.00 3.33 75.00 2.50 5.42 13.75 96.67 2.92 25.00 25.00* 7.50 0.00 1.25 38.33 72.08

* Significant at the .05 level within talker. Visual bias is the sum of all responses other than /bi/ and /gi/ for the voiced tokens and /pi/ and /ti/ for the voiceless tokens.

Kuhl, 1991, for examples). The order of conditions was counterbalanced across listeners and blocked by talker. The CVs /bi, pi/ were used for the auditory identification task. Stimuli for the McGurk task are displayed in Table 3. Each of the two dubbed tokens was presented 20 times for a total of 40 dubbed or McGurk tokens. Participants were instructed as above noted. For each CV test condition, AO identification of nonsense syllables and AV identification of McGurk stimuli, the presentation of test items was randomized within a given talker. Half of the participants received the AO condition first; the remaining participants began with the AV condition.

RESULTS Auditory-Only Syllable Identification The results for the correct identification of CV syllables in the AO condition are displayed in Table 4. For the female talker, the young participant group correctly identified nearly all of the CVs, /bi, pi/, with close to 100% accuracy. The standard deviation was less than 3% for each token identified by the young participants. The older adult group was less successful at this task than the young adult group, 98.3% for /bi/ and 85.0% for /pi/. In both cases, the standard deviations were larger for the older adults, ranging from 5.8% for the identification of /bi/ to 25.4% for the identification of /pi/. The control group, whose high-frequency thresholds had been shifted to approximate the poorest thresholds of the older adults, had the lowest overall percent correct iden-

tification of the three groups. As observed for the young and older groups, recognition of /bi/ was best at 99% correct and identification of /pi/ was poor at about 62%. The control participants also showed large standard deviations much like the older group (2.9 to 26.6). For the older adults and control groups, the standard deviations were largest for /pi/. Overall correct identification was better for tokens produced by the male talker, although the pattern of results was similar to that of the female talker. The young group showed 100% correct identification of all CV syllables spoken by the male talker. The older participants demonstrated correct identification at 97% or greater for the CVs /bi/ and /pi/. The standard deviation for the older adults who viewed the male talker was 8.7. The control group fared better with the male talker, as well, although not to the same degree as the other two participant groups. The control group showed good identification of /bi/ and /pi/ at 95% and 97.5%, respectively. Standard deviations continued to be greater for the control group (S.D. 9.7). A multivariate analysis of variance (MANOVA) was conducted to examine the talker and participant group variables. The MANOVA showed significant main effects of talker [F (1,66) ⫽ 7.45, p ⫽ 0.000], participant group [F (2,66) ⫽ 4.75, p ⫽ 0.000], and an interaction of talker x participant group [F (2,66) ⫽ 2.76, p ⫽ 0.008]. Further analyses were therefore conducted on the data for each of the talkers individually. One-way ANOVAs found the performance of the three participant groups to be equivalent for the identification of /bi/ for the female [F (2, 35) ⫽

EAR & HEARING, VOL. 23 NO. 5 1.435, p ⫽ 0.2526] and the male [F(2.35) ⫽ 0.60, p ⫽ 0. 5547] talkers. The control group appears to be at a disadvantage for the correct identification of syllables in the AO condition. For the female talker, their percent correct scores were significantly lower for /pi/ [F (2,35) ⫽ 9.49, p ⫽ 0.000]. Because the overall performance of both the older adult and control groups fell below that of the young group, it is possible that hearing level, and consequently audibility, played a part in the recognition scores. Correlational analysis showed a negative correlation between the correct identification of /pi/ and the high-frequency pure-tone average (PTA) of 1000, 2000, 4000, and 8000 Hz for experimental participants. The correlation between percent correct identification and PTA for the CV token /pi/ was ⫺0.60 for the female talker. The correlation for the male talker’s /pi/ was nonsignificant (p ⬎ 0.1). As the pure-tone average increased, an indicator of poorer hearing, the correct identification of syllables decreased. When the high-frequency pure-tone average was 30 dB HL or better, most participants had identification scores greater than 90%. In cases where the pure-tone average was greater than 30 dB HL, (all of the control group and seven of the older adult participants), the correct identification scores are more divergent.

Auditory-Visual Syllable (McGurk) Identification The results for the identification of the auditoryplus-visual CV syllables, the McGurk condition, are shown in Table 4. The table reflects the participant responses for the individual talkers, female and male, and consonant category, voiced and voiceless. A /di/ response represents optimal fusion. A /si, ␪i, i/ response represents some visual biasing (i.e., the “correct” auditory response has been altered by a visual stimulus). A /bi/ response is a correct auditory response and a /gi/ response is a correct visual response. Among the responses to the female talker and the voiced CV combination of /bi/ and /gi/, the three participant groups did not differ significantly in the number of optimally fused /di/ responses for this condition [F (2,35) ⫽ 4.11, p ⫽ 0.2055]. Differences between groups did appear among the alternative response categories. Older participants were more likely than either the young or control participants to perceive /si/ for this combination, [F (2,35) ⫽ 3.86, p ⫽ 0.0311]. A large portion of the responses for each participant group was the perception of the omission of the initial consonant. This response was used significantly more frequently by the young group, [F (2,35) ⫽ 5.50, p ⫽ 0.0078]. Overall, there

445 was a large visual bias effect (⬎90%) for all three participant groups who viewed the female talker. In contrast, the results for the voiced CVs for the male talker show a different pattern. The majority of the older and control groups responded with a fused response, /di/. The young group was significantly less likely to use this response, [F (2,35) ⫽ 4.30, p ⫽ 0.0704]. No other response categories were statistically different. The visual bias effect was quite large for the old and control participants (⬎93%) and substantially less for the young participants (72%). Results for the voiceless combinations for the female talker show the control group had significantly fewer fused McGurk responses than the other groups [F (2,35) ⫽ 8.34, p ⫽ 0.0012]. They were the only group to use the /si/ alternative, [F (2,35) ⫽ 5.03, p ⫽ 0.0124] and to perceive an omission of the initial consonant [F (2,35) ⫽ 5.50, p ⫽ 0.0087]. The young and older adults most often responded with a fused response /ti/ or chose /ki/ as the alternative response. The number of participants who responded with the correct visual alternative, /ki/, was greater in all groups in this voiceless condition compared with the voiced condition for the female talker, although the differences did not reach statistical significance. For the male talker, the results show the number of fused McGurk responses dropped 7.5% for the older participants, 17.91% for the young participants, and 25% for the control participants in comparison with the voiced versus voiceless conditions. The young group again had the lowest percentage of fused responses but this difference did not reach a level of statistical significance, [F (2,35) ⫽ 2.65, p ⫽ 0.0855]. Many participants in this condition chose the visual input /ki/ as the most favored alternative response, similar to the results found for the female talker. For both talkers, the overall visual bias effect was smaller than for the voiced consonants, averaging 70% for each of the talkers, with the smallest visual bias effect observed for the young participants. A MANOVA showed significant main effects of talker [F (1,66) ⫽ 3.32, p ⫽ 0.001], participant group [F (2,66) ⫽ 2.95, p ⫽ 0.000], and a significant interaction of talker x participant group [F (2,66) ⫽ 2.40, p ⫽ 0.001]. Examination of Table 4 reveals several points. First, the male talker elicited more optimally fused McGurk responses than the female talker [F (1,66) ⫽ 11.14, p ⫽ 0.001]; that is, /di/ was perceived for the combination of auditory /bi/ and visual /gi/ and /ti/ was perceived for the combination of auditory /pi/ with visual /ki/. Second, for the male talker, the complete omission of the initial consonant in the participant responses was rare, but this occurred as often as 38% of the time for the female

446

EAR & HEARING / OCTOBER 2002

TABLE 5. The percentage of key words correctly identified in the auditory-only, Visual-only, and auditory-plus-visual conditions for the male and female talkers. Male Talker

Female Talker

Condition

Young

Old

Control

Young

Old

Control

AO VO AV

100.00 28.67 100.00

100.00 16.00* 100.00

98.33* 23.50 100.00

100.00 37.11 100.00

99.86 29.18 100.00

98.17* 41.50 100.00

* Significant at the .05 level between groups within talker.

talker. Finally, fused McGurk responses were more frequent for the voiced versus voiceless CVs. Sentence Perception Scores for the correct identification of key words in the AO, VO, and AV conditions are displayed in Table 5. Analyses were conducted separately for the female and male talker because normative data indicated that the two talkers used in this experiment are not visually equivalent (Cienkowski, 1999).* In the AO and AV condition, all participant groups demonstrated ⬎98% correct identification of key words regardless of talker. This is at or near the ceiling of performance. In the VO condition, scores obtained for the male talker (about 16 to 29%) were poorer than those for the female talker (about 29 to 42%).This is consistent with the normative data from Cienkowski (1999) that suggest that the female talker is more visually intelligible. The young participants and the control participants had similar VO scores. The scores for the older participants were poorer. The only statistically significant difference among scores for the male talker was between the young adult and older adult groups [F (2,59) ⫽ 4.45, p ⫽ 0.0160]. A correlational analysis was used to determine *The male and female talkers in this experiment were part of a larger corpus of talkers used by Carney et al. (1999). Those investigators assessed lipreading performance for syllabic stimuli /bi, di, gi/ for 11 talkers and 36 young adult participants. Performance was scored according to percent correct viseme for each talker for the three syllabic stimuli. For the stimulus /bi/, participants judged 100% of the tokens as bilabial for the female talker and 97% as bilabial for the male talker. For the stimulus /gi/, participants judged the female talker’s productions as bilabial 8%, alveolar 26%, velar 51%, and some other category 15%. Similarly, for the stimulus /gi/, participants judged the male talker’s productions as bilabial 0%, alveolar 28%, velar 58%, and other 14%. These two talkers were quite similar for the two places of articulation examined in the current experiment. Participants differed in perception of the alveolar place of articulation. For the syllable /di/, they perceived 0% bilabial, 85% alveolar, and 15% velar for the female talker. In contrast, for the male talker’s production of /di/, participants judged 0% bilabial, 39% alveolar, 49% velar, and 12% as another category. The differences in the perception of their alveolar stimuli are unimportant for the current experiment because alveolars were not presented as stimuli in the syllabic portions of the experiment.

the relationship between optimally fused McGurk responses /di/ in the voiced condition or /ti/ in the voiceless condition and lipreading ability. The results of this analysis show that for both talkers, male and female, no significant relationship was found between the two variables in the voiced or voiceless condition. That is, lipreading ability did not appear to relate to the number of fused responses for a given participant.

DISCUSSION The primary goal of this experiment was to determine whether young and older adults, with normal to near-normal auditory and visual peripheral sensitivity, integrate auditory and visual speech information at the syllable level to the same degree. The secondary goal was to determine whether lipreading performance was related to integration as measured by the McGurk effect. The results of this experiment show that older adults are successful at integrating speech information across modalities at the syllable level. The percentage of fused responses to the combination of auditory /bi/ ⫹ visual /gi/ and auditory /pi/ ⫹ visual /ki/ were not significantly different across the age groups in all but one instance. The exception was the young adult group, who were less likely to report a fused response (/di/) when tested with the male talker using voiced CVs. The quality of the input signal plays a role in the resulting integration across modalities. For example, a clear visual signal will exert greater influence over an ambiguous auditory token than a distinct auditory token (Grant et al., 1998). The source of the auditory ambiguity may be the talker’s articulation or distortion caused by hearing loss or noise. In this experiment, the quality of the auditory signal was limited to an extent by a mild loss of high-frequency hearing for the older adult and control groups. Results of the AO syllable identification task show that the young adult group had near perfect identification of CVs for the female and male talkers. The older adult and control groups did not perform as well and demonstrated a wider range of performance. Correlational analysis found a relationship between correct identification of syllables in the AO

EAR & HEARING, VOL. 23 NO. 5 condition and hearing levels; that is, identification is related to the amount of audible information available. This is consistent with findings that show the control group, who had the poorest hearing thresholds because of masking noise, also had the lowest overall correct identification of syllables. The level of signal ambiguity manifests itself in the integration results. The young adults, for whom the syllables were clearly audible, had the lowest number, on average, of fused and visually biased responses among the participant groups. Among their alternative responses, they chose the auditory portion of the dubbed syllable with far greater frequency than either the older adult or control participants did. These results are similar to the work of previous investigators. McGurk and MacDonald (1976), Green et al. (1991), and Green and Gerdman (1995) all found that young adults to chose the auditory alternative where a fused response did not occur. Older adults in this experiment, in contrast, rarely used the auditory component as a response alternative. For the male talker, nearly all participants chose a fused response with a small number of visual and auditory token alternatives. For the female talker, a different pattern emerged. For the voiced combination of auditory /bi/ ⫹ visual /gi/, approximately 23% of the response alternatives given by the older adults were /si/. This reflects a change from the expected fused response. The typical fused response, /di/, maintains the same manner of articulation and voicing characteristic of the input signals. The alternative response, /si/, unlike /di/, differs from the input tokens in manner of articulation, from stop to fricative, as well as in voicing, from voiced to voiceless. Place information, however, appears to have been successfully integrated. It is possible that ambiguity may have been present in the visual condition as well. That is, a distinct auditory /bi/ in combination with a clear visual /gi/ results in the perception of /di/, the McGurk effect. A distinct auditory /bi/ in combination with an ambiguous visual /gi/ results in a perception that is fused in some fashion, but is not clearly a /di/. Binnie, Montgomery, and Jackson (1974) found /d, g, s/ to be in the same viseme group along with /k, n, t, z/. Similar visual confusions had been reported earlier by Fisher (1968). Sounds within a viseme group look the same on the lips; that is, the visual production of the sound is similar. In the absence of an auditory signal, sounds within a viseme group may be confused with one another. The young adults in this experiment tended to choose the auditory token in cases of visual ambiguity. The older adults, however, were less likely to use the auditory alternative. Their response of /si/ showed a visual bias short of optimal fusion. How-

447 ever, because a direct measure of CV tokens in the visual-only condition was not made empirical, support is not available. For the voiceless condition, /pi/ ⫹ /ki/, the visual token was the predominate alternative used by the older adults. The response pattern of the control group was much like the older adults. The auditory token was used infrequently as an alternative for both talkers and voicing conditions. Like the older adult group, some of the control group used /si/ as an alternative. The control group was at a clear disadvantage in the voiceless condition with the female talker because their correct identification of /pi/ in the auditoryonly condition was lower than for both the young and older adults. When /pi/ was clearly audible to the control group, as is the case for the male talker, the combination of auditory /pi/ and visual /ki/ frequently resulted in the perception of /ti/, about 72% of the responses. For the female talker, auditory identification of /pi/ was only at 62% for the control group, the percentage of fused responses to auditory /pi/ ⫹ visual /ki/ was only 25%. In this case, the control group chose /i/, another visually biased alternative response, 38% of the time The perception of an omitted initial consonant for tokens spoken by the female talker highlights the striking talker differences that are consistent throughout the experiment. Similar differences between male and female talkers were reported but not discussed as differences by other investigators (Green et al., 1991; Green & Kuhl, 1991). The talker differences are not limited to the CV stimuli. Those experimental participants who viewed the male talker had lower correct identification of key words overall in the VO condition than those participants who viewed the female talker. The lower end of the range for both talkers represents the performance of the older adult group. The older adults were less successful at correctly identifying key words in the VO condition than either young adult group. This is consistent with the results of Shoop and Binnie (1979) that showed a decline in lipreading (VO) performance with age. In everyday situations, speech is rarely presented as single syllables or words in isolation. Speech is a complex signal that involves acoustic, semantic, lexical, and syntactic information working in combination. The relationship among these variables for auditory speech recognition is quantifiable. Simple probability theory can be used to predict performance for one set of speech stimuli, such as sentences, from another group, such as syllables (Boothroyd & Nittruouer, 1988). This is in keeping with the assertion that speech represents a single underlying construct, and therefore, performance on measures of speech understanding must be related to

448

EAR & HEARING / OCTOBER 2002

one another (Bilger, 1984; Bilger, Nuetzel, Rabinowitz, & Rzeczkowski, 1984). Experimental evidence clearly supports this claim (Boothroyd & Nittrouer, 1988; Kalikow et al., 1977; Olsen, Van Tasell, & Speaks, 1997). Boothroyd (1988) showed that the predictable nature of these relationships exists for speech understanding by vision alone; that is individuals who show good lipreading performance at the syllable level also demonstrate good performance at the sentence level. Studies that have looked at the relationship among performance measures of audio-visual speech perception have not answered this question definitively. Some studies have found no relationship between integration at the syllable level, as seen in the McGurk effect, and integration for sentence level materials (Easton & Basala, 1982; Grant et al., 1998). Others have found that integration occurs at the sentence level if the proper conditions are compared (Dekle, Fowler, & Funnel, 1992). Integration in this project was measured at the syllable level only; therefore, it is not possible to make a direct comparison between auditory-visual integration of syllables and sentences. Measures were made of VO understanding of sentence materials that can be compared with the integration results. The rationale for this comparison is that for successful integration to occur across modalities, information within each modality must be successfully processed. If speech understanding represents a single construct regardless of the input modality, one would expect a relationship to exist among the unimodal and bimodal conditions. The results from this study provide no evidence of such a relationship. However, correct identification of syllables in a VO condition was not assessed, therefore a complete comparison with AV integration could not be made. If a relationship exists between AV integration at the syllable level and unimodal visual-only performance at the sentence level, it has not been shown by the experiments reported here. If lipreading is considered a cognitive ability, the lack of a relationship between lipreading and AV integration as measured by the McGurk effect may be taken as a suggestion of a more peripheral component to auditory-visual integration or alternatively that the two processes are independent of one another (Braida, 1991; Massaro, 1984).

CONCLUSION This experiment was designed to determine the extent to which older adults integrate information across the auditory and visual modalities for the perception of speech. Integration was considered at the level of simple syllables. The relationship be-

tween integration and successful performance on unimodal speech recognition tasks was also examined. Of the three participant groups in this experiment—young, old, and control—no single group had perfect integration as defined by a clearly fused response in every instance. However, there was clear evidence of a predominance of visual bias for all participants. The young group showed the least visual bias overall across experimental conditions. Although there are no significant differences between groups for integration at the syllable level, there are differences in the response alternative. Young adults who have normal peripheral auditory sensitivity often chose an auditory alternative when optimal integration failed. Older adult and control participants whose peripheral auditory sensitivity was diminished by high-frequency hearing loss or noise masking leaned toward visual alternatives. Based on the previous statements, one could suggest that when auditory and visual integration of speech information failed to occur, producing a nonfused response, participants selected an alternative response from the modality with the least ambiguous signal. For the young adults, the auditory signal was the least ambiguous choice. However, for the older adult and control participants, the visual input was a more reasonable candidate given the reduction in available auditory information.

ACKNOWLEDGMENTS: Portions of this manuscript were presented at the meeting of the American Speech Language and Hearing Association, San Francisco, 1999. This work was supported by NIH-NINCD 001-DC 00110. Address for correspondence: Kathleen M. Cienkowski, Ph.D., Department of Communication Sciences, University of Connecticut, 850 Bolton Road, Box U-85, Storrs, CT, 06269. E-mail: [email protected]. Received December 12, 2000; accepted December 13, 2001

REFERENCES Alpiner, J., & McCarthy, P. (1987). Rehabilitative Audiology: Children and Adults. Baltimore: Williams & Wilkins. ANSI (1989). American national standard specifications for audiometers. ANSI S3.6 –1989. New York: ANSI. Bilger, R. C. (1984). Speech recognition test development. In E. Elkins (Ed.), Speech Recognition by the Hearing Impaired (pp. 1–15). ASHA Reports, 14, Rockville MD: ASHA. Bilger, R., Nuetzel, J., Rabinowitz, W., & Rzeczkowski, C. (1984). Standardization of a test of speech perception in noise. Journal of Speech and Hearing Research, 27, 32– 48. Binnie, C., Montgomery, A., & Jackson, P. (1974). Auditory and visual contributions to the perception of consonants. Journal of Speech and Hearing Research, 17, 619 – 630. Boothroyd, A. (1988). Linguistic factors in speechreading. In C. L. DeFilippo & D. G. Sims (Eds.), New Reflections on Speechreading (monograph). Volta Review, 90, 77– 87.

EAR & HEARING, VOL. 23 NO. 5 Boothroyd, A., & Nittrouer, S. (1988). Mathematical treatment of context effects in phoneme and word recognition. Journal of the Acoustical Society of America, 84, 101–114. Braida, L. (1991). Crossmodal integration in the identification of consonant segments. Quarterly Journal of Experimental Psychology, 43, 647– 677. Carney, A. E., Clement, B. R., & Cienkowski, K. M. (1999). Talker variability effects in auditory-visual speech perception. Journal of the Acoustical Society of America, 106, 2270(A). Cavanaugh, J. C. (1998). Adult Development and Aging. New York: Brooks/Cole. Christensen, K., Moye, J., Armson, R., & Kern, T. (1992). Health screening and random. recruitment for cognitive aging research. Psychology and Aging, 7, 204 –208. Cienkowski, K. M. (1999). Auditory-visual speech perception across the lifespan [Doctoral Dissertation, University of Minnesota, 1999]. Dissertation Abstracts International, 60/01, 116. Committee on Hearing and Bioacoustics and Biomechanics (1988). Speech understanding and aging. Journal of the Acoustical Society of America, 83, 859 – 893. DeFilippo, C. L. (1988). Tracking for speechreading training. In C. L. DeFilippo & D. G. Sims (Eds.) New Reflections on Speechreading (monograph), Volta Review, 90, 215–240. Dekle, D., Fowler, C., & Funnell, M. (1992). Audiovisual integration in perception of real words. Perception and Psychophysics, 51, 355–362. Dodd, B., Plant, G., & Gregory, M. (1989). Teaching lipreading: The efficacy of lessons on video. British Journal of Audiology, 23, 229 –238. Easton, R., & Basala, M. (1982). Perceptual dominance during lipreading. Perception and Psychophysics, 32, 562–570. Erber, N. (1969). Interaction of audition and vision in the recognition of oral speech stimuli. Journal of Speech and Hearing Research, 12, 423– 425. Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 12, 796 – 804. Grant, K., & Seitz, P. (1998). Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America, 104, 2438 –2450. Grant, K., & Walden, B. (1996). Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America, 100, 2415–2424. Grant, K., Walden, B., & Seitz, P. (1998). Auditory-visual speech recognition by hearing impaired subjects: Consonant recognition, and auditory-visual integration. Journal of the Acoustical Society of America, 103, 2677–2690. Green, K., & Gerdman, A. (1995). Cross modal discrepancies in coarticulation and the integration of speech information: The McGurk effect with mismatched vowels. Journal of Experimental Psychology: Human Perception and Performance, 21, 1409 – 1426. Green, K., & Kuhl, P. (1991). Integral processing of visual place and auditory voicing information during phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 17, 278 –288. Green, K., Kuhl, P., Meltzoff, A., & Stevens, E. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception and Psychophysics, 50, 524 –536. Humes, L., & Christopherson, L. (1991). Speech identification difficulties of hearing-impaired elderly persons: The contributions of auditory processing deficits. Journal of Speech and Hearing Research, 34, 686 – 693. Humes, L., & Roberts, L. (1990). Speech recognition difficulties of the hearing impaired elderly: The contributions of audibility. Journal of Speech and Hearing Research, 33, 726 –735.

449 Jerger, J., Jerger, S., Oliver, T., & Pirozzolo, F. (1989). Speech understanding in the elderly. Ear and Hearing, 10, 79 – 89. Kalikow, D., Stevens, K., & Elliot, L. (1977). Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. Journal of the Acoustical Society of America, 61, 1337–1351. Karp, A. (1988). Reduced vision and speechreading. Volta Review, 90, 61–74. Kline, D., & Schieber, F. (1985). Vision and aging. In J. E. Birren & K. W. Schaie (Eds.), Handbook of the Psychology of Aging (pp. 296 –331). New York: Van Nostrand Reinhold. MacLeod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise. British Journal of Audiology, 21, 131–141. Massaro, D. (1984). Children’s perception of visual and auditory speech. Child Development, 55, 1777–1788. Massaro, D., & Cohen, M. (2000). Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America, 108, 784 –789. Matthies, M. L., & Carney, A. E. (1988). A modified speech tracking procedure as a communicative performance measure. Journal of Speech and Hearing Research, 31, 394 – 404. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746 –748. Middelweerd, M., & Plomp, R. (1984). The effect of speechreading on the speech reception threshold of sentences in noise. Journal of the Acoustical Society of America, 82, 2145–2147. Morrell, C., Gordan-Salant, S., Pearson, J., Brant, L., & Fozard, J. (1996). Age- and gender-specific reference ranges for hearing level and longitudinal changes in hearing level. Journal of the Acoustical Society of America, 100, 1949 –1967. Olsen, W., Van Tasell, D., & Speaks, C. (1997). Phoneme and word recognition for words in isolation and in sentences. Ear and Hearing, 18, 175–188. Pelli, D., Robson, J., & Wilkins, A. (1988). The design of a new letter chart for measuring contrast sensitivity. Clinical Vision Science, 2, 187–199. Pichora-Fuller, M. K., Schneider, B., & Daneman, M. (1995). How young and old adults listen to and remember speech in noise. Journal of the Acoustical Society of America, 97, 593– 608. Plude, D., & Hoyer, W. (1985). Attention and performance: Identifying and localizing age deficits. In N. Charness (Ed.), Aging and Human Performance (pp. 47–99). Chichester, England: Wiley. Shoop, C., & Binnie, C. (1979). The effects of age on the visual perception of speech. Scandinavian Audiology, 8, 3– 8. Silverman, S., & Hirsch, I. (1955). Problems related to the use of speech in clinical audiometry. Annals of Otolaryngology, 64, 1234 –1245. Sumby, W., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. Summerfield, Q. (1979). Use of visual information for phonetic perception. Phonetica, 36, 314 –331. Summerfield, Q. (1991). Visual perception of phonetic gestures. In I. G. Mattingly & M. Studdert-Kennedy (Eds.), Modularity and the Motor Theory of Speech Perception (pp. 117–138). Hillsdale, NJ: Erlbaum. Summerfield, Q. (1992). Lipreading and audio-visual speech perception. Philosophical Transcripts of the Royal Society of London B, 335, 71–78. Walden, S., Busaco, D., & Montegomery, A. (1993). Benefit from visual cues in auditory-visual speech recognition by middleaged and elderly persons. Journal of Speech and Hearing Research, 36, 431– 436.