Language and Speech

0 downloads 0 Views 365KB Size Report
Jun 24, 2011 - Speech was distorted through an eight channel noise-vocoder which shifted the ... vocoding process retains the coarser spectral and temporal ...
Language and Speech http://las.sagepub.com/

Audiovisual Cues and Perceptual Learning of Spectrally Distorted Speech Michael Pilling and Sharon Thomas Language and Speech 2011 54: 487 originally published online 24 June 2011 DOI: 10.1177/0023830911404958 The online version of this article can be found at: http://las.sagepub.com/content/54/4/487

Published by: http://www.sagepublications.com

Additional services and information for Language and Speech can be found at: Email Alerts: http://las.sagepub.com/cgi/alerts Subscriptions: http://las.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://las.sagepub.com/content/54/4/487.refs.html

>> Version of Record - Nov 29, 2011 OnlineFirst Version of Record - Jun 24, 2011 What is This?

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

404958

LAS54410.1177/0023830911404958Pilling and ThomasLanguage and Speech

Language and Speech

Article

Audiovisual Cues and Perceptual Learning of Spectrally Distorted Speech

Language and Speech 54(4) 487­–497 © The Author(s) 2011 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0023830911404958 las.sagepub.com

Michael Pilling and Sharon Thomas MRC Institute of Hearing Research, Nottingham, UK

Abstract Two experiments investigate the effectiveness of audiovisual (AV) speech cues (cues derived from both seeing and hearing a talker speak) in facilitating perceptual learning of spectrally distorted speech. Speech was distorted through an eight channel noise-vocoder which shifted the spectral envelope of the speech signal to simulate the properties of a cochlear implant with a 6 mm place mismatch. Experiment 1 found that participants showed significantly greater improvement in perceiving noise-vocoded speech when training gave AV cues than when it gave auditory cues alone. Experiment 2 compared training with AV cues with training which gave written feedback. These two methods did not significantly differ in the pattern of training they produced. Suggestions are made about the types of circumstances in which the two training methods might be found to differ in facilitating auditory perceptual learning of speech. Keywords audiovisual, perceptual learning, speech

1 Introduction As we speak we produce movements in our jaw, lips, and tongue that are easily perceptible to the eye. This visual speech information gives useful sensory cues that help decode the speech that we hear, particularly in terms of features such as place of articulation, rounding, and stress (Summerfield, 1987; Dohen et al., 2004; Munhall et al., 2004). As a consequence of this, audiovisual (AV) speech, speech where a talker is both seen and heard, tends to be more perceptible than speech where the talker can only be heard. The advantages in perceiving AV speech over unimodal auditory speech are most apparent when the heard speech is either impoverished or distorted in some way (Erber, 1975; MacLeod & Summerfield, 1987; Gagné et al., 1994; Helfer & Freyman, 2005). However, advantages for AV speech presentation have been documented even under ideal listening conditions (e.g., Davis & Kim, 2004).

Corresponding author: Michael Pilling, Department of Psychology, Oxford Brookes University, Headington Campus, Oxford, OX3 0BP, UK Email: [email protected]

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

488

Language and Speech 54(4)

The visual cues in AV speech influence how speech is perceived. Such effects are demonstrated by the McGurk effect (McGurk & MacDonald, 1976; Munhall et al., 1996). In this phenomenon certain incongruent auditory and visual speech tokens, e.g., auditory /ba/; visual /ga/, when presented simultaneously, result in a percept which is a fusion of the two modalities, e.g., /ta/. Evidence suggests that this integration of visual and auditory speech cues occurs at a relatively early stage of perceptual processing (Green, 1997; Schwartz et al., 2004), and that it is largely obligatory (Soto-Faraco et al., 2004). AV cues can also play a role in speech learning. In infants AV cues have been shown to facilitate discrimination of auditory speech contrasts (Teinonen et al., 2008). In adult populations AV cues have been shown to facilitate learning of phonetic contrasts in L2 learners of English when the contrast is one which was absent in the native language and where the contrast produces visually salient movements for English speakers (Hazan et al., 2005; see also Hardison, 2003). Another area of perceptual learning in which AV cues might play a role is in learning to comprehend speech which has altered characteristics. Noise-vocoded speech is one particular form of altered speech which is of particular scientific interest. This is speech which has properties which reflect those of natural speech when heard through a cochlear implant device (Dorman et al., 1998). The noisevocoding process retains the coarser spectral and temporal properties of speech while removing its finer spectro-temporal structure. For normal hearing listeners the intelligibility of this speech is partly determined by the number of frequency bands used in its production (Shannon et al., 1995). These individual bands can be considered analogous to the independent electrode channels in an implant device. If at least eight bands are used the speech can be intelligible almost immediately, even to a naïve listener. If only four or fewer bands are used then initial speech intelligibility can be rather low, but is usually improved by training (Davies et al., 2005; Hervais-Adelman et al., 2008; Rota et al., 2008). The implant electrodes of a cochlear implant device can be inserted only part way into the cochlea. This means that the band limits of the speech-processor analysis filter driving the electrode tend to be mismatched to the characteristic frequency of the primary auditory nerve fibers which are stimulated by the electrode (see Shannon et al., 1998). The spectral envelope of noise-vocoded speech can also be shifted to simulate this basalward shift of excitation that is thought to be common with cochlear implants. The change in timbre resulting from this additional transformation of the speech signal has a generally negative effect on perceptibility. Negative effects are found on the initial intelligibility of upward shifted speech even if it is constructed using a sufficient number of bands for the unshifted speech to be readily intelligible (Dorman et al., 1997; Shannon et al., 1998; Stacey & Summerfield, 2007). However, as with unshifted speech, perceptibility is often dramatically improved when training is given (Rosen et al., 1999; Fu & Galvin, 2003; Fu et al., 2006; Stacey & Summerfield, 2007, 2008). Rosen et al. (1999) investigated learning of this shifted noise-vocoded speech in a study which gave participants access to AV cues during training. In this study participants were trained using a connected discourse tracking task in which they attempted to repeat segments of spoken text in communication with a talker in a separate booth who could be seen (through a glass screen) but whose speech was only heard through a noise-vocoding circuit. This training was effective in improving purely auditory recognition of the altered speech: Before training participants could identify less than one percent of the keywords by ear alone in an initial test block of sentences; after twelve sessions of training they could identify nearly forty percent of the keywords. Though these effects are impressive the main object of the study was not to assess the value of AV cues in auditory learning. As a consequence of this no control training condition was given from which the specific benefit of AV cues could be determined. The current paper aims to determine the

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

489

Pilling and Thomas

effectiveness of AV cues in driving auditory learning of spectrally altered speech. This is done by comparing training with AV cues against training conditions which are auditory only (AO) in nature. The effectiveness of training in promoting learning of spectrally distorted speech is measured by the change in performance on AO test blocks given before and after training. Experiment 1 compared the effectiveness of three training conditions: AV, AO, and AO-Natural. AV training consisted of exposure to spectrally distorted auditory speech accompanied by a video of the face of the talker producing the original speech. AO training consisted of exposure to spectrally distorted auditory speech without the video. Finally, AO-Natural training consisted of exposure to auditory presentations of natural (i.e., undistorted) speech, also without the video.

2 Experiment 1 2.1 Method 2.1.1 Participants.  Forty-two participants were recruited from the student population of the University of Nottingham. Participants were paid £10 for their time. All were English native speakers. All had normal hearing (pure-tone detection threshold levels of more than 25 dB in a range of 250 to 8000 Hz) and normal, or corrected-to-normal, vision (20/20 acuity on a Snellen test). None had formerly taken part in any study involving either perceptual learning or exposure to spectrally distorted speech. 2.1.2 Stimuli and equipment.  A single male talker with a southern British accent was video recorded uttering the sentences of the BKB sentence battery (Bench et al., 1979). BKB sentences have a simple vocabulary and their use in measuring speech perceptibility is well established. Sentences contain 3 to 4 keywords (e.g., the orange is quite sweet; rain falls from the clouds). Recordings were made in a sound attenuated room. The talker was recorded using a Sony DSR200AP digital camcorder. Video was recorded at a sample rate of 25 frames-per-second and audio at 48 kHz. The camcorder was placed 1.5 m from the talker’s face. The talker’s face was filmed from the front with a white screen as background. Three lamps were positioned to illuminate the face in a way in which shadowing did not occur. Auditory speech was captured by a microphone attached below the talker’s face and out of sight of the camera. The talker began each sentence from a neutral facial expression with both lips held together. Recordings were edited offline into separate clips of each sentence using video editing software (Final Cut Pro 4, Apple Inc., CA). During editing a 1000 ms still frame was placed at the beginning and end of each clip. This still frame was given a 500 ms fade in/out from black. The audio track was cleaned of any background noise using the noise removal algorithm in Audacity software V1.2.6 (Mazzoni et al., 2006). Routines written in Matlab were used to produce the noise-vocoding and spectral shifting of the auditory speech. The technique used was similar to that described by Rosen et al. (1999). The auditory signal was first filtered through eight sixth-order elliptical IIR input filters. Filtered waveforms were half-wave-rectified and low-pass filtered (160 Hz). Envelopes were then multiplied by low-pass filtered (10 kHz) white noise. The signal from each of the channels was then filtered by eight sixth-order elliptical IIR output filters with central frequencies which had been shifted upwards to simulate a 6 mm basilar membrane displacement. The range and central frequencies of the individual input and output frequencies are given in Table 1. The eight channels were then summed into a single digital waveform. The output auditory waveform was resynchronized with the video recording to create the AV stimulus materials for the experiment. Resynchronization of the output waveform was done by manually matching the start and end points of this waveform with that of the original recorded auditory waveform in the

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

490

Language and Speech 54(4)

Table 1.  Range and central frequency of the eight input and output filter bands used in noise-vocoding Band

Input filters

Output filters

Frequency range (Hz) 1 2 3 4 5 6 7 8

 350–530  530–773  773–1101 1101–1543 1543–2140 2140–2946 2946–4033 4033–5500

Central frequency (Hz)  433  642  925 1306 1820 2513 3449 4712

Frequency range (Hz) 1051–1428 1428–1984 1984–2736 2736–3749 3749–5117 5117–6962 6962–9453 9453–12813

Central frequency (Hz)  1206  1685  2332  3205  4382  5971  8115 11007

video file. This was done within the video editing software earlier described. After synchronization was completed the original clean auditory waveform was deleted from the video file to leave the noise-vocoded auditory waveform. This version of the audiovisual file was then converted to Apple QuickTime format for use in the experiment. The experiment was performed on a G4 Apple Macintosh computer. Video was displayed on a Macintosh plasma VDU. Sound was presented through a separate loudspeaker connected to the computer via an amplifier. Software routines written in SuperCard (V. 4.1.1, Solutions Etcetera, CA) controlled all aspects of trial randomization, stimulus presentation, and recording of typed responses. Participants typed in responses using a standard Macintosh computer keyboard. These were recorded onto the computer’s hard-disk. For all participants, an initial highly intelligible noise-vocoded sentence (24 frequency bands and no frequency shift) was given as an initial practice trial. This was done to familiarize participants with the task and to ensure that the instructions were understood correctly. 2.1.3 Procedure.  Experiment 1 consisted of three trial blocks, Pretraining, Training, and Posttraining. All blocks contained 76 sentences. Sentences were allocated to blocks pseudorandomly with the constraint that no sentence appeared in more than one block and that blocks all had an equal number of keywords (235). Within each block no sentence was ever repeated. Randomization of the sentence materials was done separately for each participant. All participants performed the same Pretraining and Posttraining blocks which functioned as test blocks. Participants were randomly assigned in equal numbers to one of three training groups: AV, AO, AO-Natural. For test and training blocks participants were required to listen to the speech (while also looking at the screen in the AV training condition) and type as much of each sentence as was understood into the computer using the keyboard. Participants were encouraged to guess if unsure. If they did not understand anything from a particular sentence they were told to type “I don’t know”. Participants pressed the return key to record their responses into the computer. The pressing of this key instigated the next trial after a 1000 ms blank interval. No feedback was given on any trial. A five minute break was given between successive blocks.

2.2 Results and discussion A loose-keyword scoring method (Bench et al., 1979) was used to determine performance in each block. In this method each correct keyword receives a single point, irrespective of the order that words are reported in and morphological errors are accepted as correct. The total number of

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

491

Pilling and Thomas Table 2.  Mean correctly identified keywords in Pretraining, Training, and Posttraining blocks (from a maximum of 235) in Experiment 1 Training group

Block Pretraining

Training

Posttraining

64 57 53

234  81 173

 93  90 115

AO-Natural AO AV

Improvement (Posttraining minus Pretraining) 29 33 62

keywords correct was calculated for each participant using the above criteria. Mean scores in each block are given in Table 2. These are given separately for each of the three training groups. The fourth column of this table shows the mean improvement for the three training groups calculated by subtracting the performance in the Pretraining block from that of the Posttraining block. For the training blocks analysis showed that the number of correctly reported keywords was significantly higher in the AV condition than the AO training condition, t(26) = 5.54, p < .001; however AV training block performance was still lower than performance in the AO-Natural training block, t(26) = 5.33, p < .001. A one-way ANOVA was used to look at test block performance. This ANOVA compared the mean test improvement scores (i.e., Posttest minus Pretest) for the three training groups. This showed a significant effect of training group, F(2, 39) = 10.7, p < .001. Post-hoc testing (Studentized Newman-Keuls) indicated that this effect was the result of the AV trained group showing greater improvement across the test blocks than either the AO and AONatural training groups (p < .05); the AO and AO-Natural training groups did not differ statistically in the level of improvement produced after training (p > .05). Thus, Experiment 1 showed that AV training was more effective than AO training in promoting learning of spectrally altered speech. It suggested that exposure to AV speech can drive learning at an auditory perceptual level. Experiment 2 further explored this AV training effect.

3 Experiment 2 In Experiment 1 training was evaluated only across performance on two test blocks presented before and after a single block of training. The results from this experiment therefore give little indication of the time course that training might take. Experiment 2 gave shorter but more frequent alternating training and test blocks throughout the experiment. This allowed a more continuous monitoring of the effect of training on the auditory learning of speech. Experiment 2 also had a second purpose. Previous studies have demonstrated that perceptual learning of speech can be facilitated by presenting written feedback alongside auditory speech (Davis et al., 2005; Stacey & Summerfield, 2007, 2008). Written feedback seems to benefit auditory perceptual learning because of the lexical cues it provides: no effect of written feedback is found when training uses non-word speech (Davis et al., 2005; cf. Hervais-Adelman et al., 2008). Experiment 2 compared the AV training effect identified in Experiment 1 with training using written feedback (AO+Text). The AV and AO+Text training groups were both compared against an AO training group given as a baseline training condition. As in Experiment 1 the AO training group received only auditory speech on training blocks. The effectiveness of training was measured by changes in test block performance over the course of the experiment. Test blocks were presented as AO for all three training groups.

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

492

Language and Speech 54(4)

Table 3.  Mean overall correctly identified keywords in test blocks (from a maximum of 435) in Experiment 2 Training group

Test blocks

AO AV AO+Text

128 204 221

3.1 Method 3.1.1 Participants.  Forty-five students of the University of Nottingham were selected using the same inclusion criteria as Experiment 1. Participants all received a payment of £10. No participant had taken part in Experiment 1. Participants were allocated in equal numbers to one of the three training groups using a randomization procedure. 3.1.2 Stimuli.  Stimuli were recordings of BKB sentences. Recordings were of the same origin as Experiment 1: 285 recorded sentences were used, 145 of these sentences were allocated as test trial stimuli and 140 of the sentences were used as training stimuli. 3.1.3 Procedure.  Training and test blocks each contained five sentences. As in Experiment 1 randomization of sentences into the test blocks was done separately for each participant. This was done with the constraint that no sentence was presented twice to a participant and that the five sentences in each test block always contained 15 keywords. The AO and AV training trials were the same as described for Experiment 1. The AO+Text training consisted of presentations of the distorted speech simultaneously with the text of the sentence being displayed on screen. Text was presented in uppercase at 45 point in Helvetica Neue font. This text scrolled across the computer screen word by word moving from right to left. The presentation of the words was at a rate which approximated that in the recorded speech of the talker. The first block was a test block. After this alternating training and test blocks were given. The last block was also a test block. In all 28 training and 29 test blocks were given. On all trials participants were instructed to type as much as they could decipher from the presented sentence into the computer.1 They were encouraged to guess if unsure and told to type “I don’t know” on trials where they were unable to guess anything of the sentence content. No break was given between any blocks though participants were informed that they could have a break at any point in the experiment by delaying pressing the return key after typing in their response to a trial. A practice sentence was given to the participant before starting the experiment as described in Experiment 1.

3.2 Results and discussion The same loose-keyword scoring method was used as in Experiment 1. Table 3 gives the mean overall number of keywords correctly identified in each test block for the three training groups. Figure 1 shows a plot of the number of keywords recognized for individual test blocks for the three training groups. The three groups did not initially statistically differ from one another in terms of performance on the first test block, F(2, 42) = 2.76, MSE = 1.29, p > .05. Performance in subsequent test blocks was analyzed by organizing the test blocks into four separate bins. The first bin contained test blocks 2–8, the second contained blocks 9–15, the third blocks 16–22, and the fourth blocks 23–29. These were then entered into a two-way mixed ANOVA, with training condition as a three-level

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

493

Pilling and Thomas

independent factor (AO, AV, AO+Text), and training duration as a four-level related factor (bin 1, bin 2, bin 3, bin 4). There was a significant main effect of training condition, F(2, 42) = 7.14, MSE = 26.30, p < .01, and a significant linear main effect of training duration, F(1, 42) = 40.63, MSE = 3.42, p < .001. There was no significant interaction between the main effects F(2, 42) = 0.52, MSE = 3.42, p > .05. Post-hoc testing (Studentized Newman-Keuls) of the training condition main effect revealed that test performance with AO-distorted training was significantly poorer than with AVdistorted or AO+Text training. However, AV and AO+Text trained groups did not differ from each other statistically in test performance in this analysis (p > .05). The absence of a significant interaction between the main effects suggests that differences in performance between the training groups occurred during the initial training blocks and that these differences were then maintained across the rest of the experiment. This interpretation is supported by observation of the plots of the individual test trials in Figure 1. It can be seen that, for all three training groups, the most rapid performance increases occur in the initial test blocks and begin to asymptote in later test blocks.

4 General discussion Experiments 1 and 2 both demonstrate the effectiveness of AV training in facilitating perceptual learning of spectrally altered speech. One question concerns how this facilitation occurs. When the

15 AV AO

Number of correct keywords

AO+Text 10

5

29

27

25

23

21

19

17

15

13

9

11

7

5

3

1

0 Test Block

Figure 1.  Number of correctly identified keywords across the 29 test blocks in Experiment 2 for each of the three training conditions (AV, AO, AO+Text)

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

494

Language and Speech 54(4)

auditory system first encounters spectrally altered speech it presumably has some difficulty in identifying its phonetic structure: The features of the incoming speech will differ somewhat from those of stored phonetic representations. Visual speech cues may assist in this process. Visual speech contains perceptually salient cues about the phonetic structure of auditory speech, cues which tend to anticipate their occurrence in the auditory signal (Grant et al., 1998; Hazen, 2006). It has been suggested that these cues can operate by constraining the phonetic categorization of speech, for instance if speech is being articulated on the lips this indicates that a given sound is a bilabial consonant such as /p/, /b/ or /m/ (see van Wassenhove et al., 2005). By constraining the phonetic interpretation of the auditory signal in this way visual cues might simultaneously provide guidance to auditory learning mechanisms in adjusting to the characteristics of the speech. This additional guidance is likely to make learning more successful than when the perceiver must rely on auditory cues alone. Experiment 2 also showed that AO training with written feedback produced greater learning than observed with AO speech alone, replicating several earlier findings (e.g., Davis et al., 2005; Stacey & Summerfield, 2007). This form of training was equally as successful as AV training in facilitating perceptual learning of the given speech. The two forms of training are similar in as much as they both give additional visually presented cues to disambiguate the altered speech. At a phenomenal level both methods produce an immediate improvement in the perceived clarity of the speech heard on training trials (an effect with written feedback described as “pop-out”, Davis et al., 2005). However, while written feedback is thought to give top-down lexical information, visual speech is more likely to give cues which are largely bottom-up and sensory in nature. Our experiment suggests that, under the specific conditions of our experiment at least, top-down lexical guidance is as effective as guidance from visual sensory cues in facilitating auditory learning. This similarity in the effectiveness of the two training methods is likely to be coincidental and is unlikely to hold under all circumstances. The time course of perceptual learning for noisevocoded speech tends to be longer when the speech is created from fewer than the eight bands used in the current study. With four band noise-vocoded speech, Rosen et al. (1999) found that learning continued to some degree over sessions spanning a number of days. The two training methods may be found to differ if the auditory speech is constructed from fewer bands under conditions where learning is more protracted. It is known that AV cues benefit speech perception across a wide range of auditory conditions, even where only minimal auditory speech cues are available (e.g., Erber, 1975; Saldaña et al., 1996). It is quite possible that written feedback is more limited in the range in which it is able to guide speech perception mechanisms. To test this, further research is needed to assess the relative effectiveness of the two training methods over a varied range of noise-vocoded speech conditions. There are other factors which may be of differential importance to the two training methods. Compared to written feedback the effectiveness of AV training is likely to be more dependent on the characteristics of the talker used. It is known that talkers vary considerably in the visual intelligibility of the speech they produce (Kricos & Lesner, 1982; Lesner, 1988; Demorest & Bernstein, 1992; Daly et al., 1996). Relative to written feedback training, AV training will probably be less effective with a talker with low visual speech intelligibility. Finally, the choice of speech materials themselves might influence the relative effectiveness of the two training methods. Written feedback training depends on the lexicality of the training speech materials, no training effect occurs when the items are non-words (Davies et al., 2005). If, as suggested above, AV training depends only on sensory cues then facilitation should be found even with non-word speech materials. AV training should drive perceptual learning just as well with non-word speech materials as with speech consisting of meaningful sentences. Further research is clearly required to understand the different circumstances in which AV speech and written text are effective. What can be concluded from the current study is that AV

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

495

Pilling and Thomas

speech cues, at least under some circumstances, are as effective as written feedback in driving perceptual learning. Practical suggestions can be made about how this form of training might be incorporated in a clinical context. Children who are prelingually deafened often have poor reading skills, possibly as a result of a less comprehensive understanding of grapheme–phoneme relations (e.g., James et al., 2008). However, such children when fitted with implants still derive considerable immediate benefits from access to AV cues when listening to speech (Bergeson et al., 2005). Such individuals may therefore find greater benefit from training which utilizes AV cues than training which is reliant on the presentation of text or on auditory presentations alone. Research is needed which evaluates the usefulness of AV training in comparison to other methods in the context of cochlear implant rehabilitation.

5 Conclusion AV training was shown to reliably facilitate learning of spectrally distorted speech. It is suggested that this effect is driven by bottom-up sensory guidance of auditory learning mechanisms. Under the circumstances tested AV training was at least as effective as training giving written feedback. Further research in both normal hearing and clinical populations is needed to determine the conditions under which the two methods are most useful in facilitating learning of this speech. Acknowledgements Some of these data were presented at the BSA meeting Cambridge, UK, 14–15 September (2006) and at AVSP (Hilvarenbeek, Netherlands, 2007). The first author is now at Oxford Brookes University, Headington, Oxford, UK. The comments of Andrew Faulkner and one anonymous referee on previous versions of this paper are gratefully acknowledged.

Note 1

For the training trials of the AO+Text condition this was a trivial task because participants were being presented with the sentence contents in the form of the presented text. However, by giving this task it ensured that they did attend to the written cues as they were presented on each training trial.

References Bench, J., Kowal, A., & Bamford, J. (1979). The BKB (Bamford-Kowal-Bench) sentence lists for partiallyhearing children. British Journal of Audiology, 13, 108–112. Bergeson, T. R., Pisoni, D. B., & Davis, R. A. O. (2005). Development of audiovisual comprehension skills in prelingually deaf children with cochlear implants. Ear & Hearing, 26, 149–164. Daly, N., Bench, J., & Chappell, H. (1996). Gender differences in speechreadability. Journal of the Academy of Rehabilitative Audiology, 29, 27–40. Davis, C., & Kim, J. (2004). Audio-visual interactions with intact clearly audible speech. Quarterly Journal of Experimental Psychology A, 57, 1103–1121. Davis, M. H., Johnsrude, I. S., Hervais-Adleman, A., Taylor, K., & McGettigan, C. (2005). Lexical information drives perceptual learning of distorted speech: Evidence from comprehension of noise-vocoded sentences. Journal of Experimental Psychology: General, 34, 222–241. Demorest, M. E., & Bernstein, L. E. (1992). Sources of variability in speechreading sentences: A generalizability analysis. Journal of Speech & Hearing Research, 35, 876–891. Dohen, M., Lœvenbruck, H., Cathiard, M. A., & Schwartz, J. L. (2004). Visual perception of contrastive focus in reiterant French speech. Speech Communication, 44, 155–172. Dorman, M. F., Loizu, P. C., & Rainey, D. (1997). Simulating the effect of cochlear-implant electrode insertion depth on speech understanding. Journal of the Acoustical Society of America, 102, 2993–2996.

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

496

Language and Speech 54(4)

Dorman, M. F., Loizu, P. C., Fitzke, J., & Tu, Z. (1998). The recognition of sentences in noise by normalhearing listeners using simulations of cochlear-implant signal processors with 6–20 channels. Journal of the Acoustical Society of America, 104, 3583–3585. Erber, N. P. (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40, 481–492. Fu, Q. J., & Galvin, J. J. (2003). The effects of short-term training for spectrally mismatched noise-band speech. Journal of the Acoustical Society of America, 113, 1065–1072. Fu, Q. J., Nogaki, G., & Galvin, J. J. (2006). Auditory training with spectrally shifted speech: Implications for cochlear implant patient auditory rehabilitation. Journal of the Association for Research in Otolaryngology, 6, 180–189. Gagné, J. P., Masterson, V. M., Munhall, K. G., Bilida, N., & Querengesser, C. (1994). Across talker variability in auditory, visual, and audiovisual speech intelligibility for conversational and clear speech. Journal of the Academy of Rehabilitative Audiology, 27, 135–158. Grant, K. W., Walden, B. E., & Seitz, P. F. (1998). Auditory-visual speech recognition by hearing-impaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America, 103, 2677–2690. Green, K. P. (1997). The use of auditory and visual information during phonetic processing: Implications for theories of speech perception. In R. Campbell, B. Dodd, & D. Burnham (Eds.), Hearing by eye II: Advances in the psychology of speechreading and auditory-visual speech (pp. 3–25). Hove: Psychology Press. Hardison, D. (2003). Acquisition of second-language speech: Effects of visual cues, context and talker variability. Applied Psycholinguist, 24, 495–522. Hazan, V., Sennema, A., Iba, M., & Faulkner, A. (2005). Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English. Speech Communication, 47, 360–378. Hazen, T. J. (2006). Visual model structures and synchrony constraints for audio-visual speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1082–1089. Helfer, K. S., & Freyman, R. L. (2005). The role of visual speech cues in reducing energetic and informational masking. Journal of the Acoustical Society of America, 117, 842–849. Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., & Carlyon, R. P. (2008). Perceptual learning of noisevocoded words: Effects of feedback and lexicality. Journal of Experimental Psychology: Human Perception & Performance, 34, 460–474. James, D., Brinton, J., Rajput, K., & Goswami, U. (2008). Phonological awareness, vocabulary, and word reading in children who use cochlear implants: Does age of implantation explain individual variability in performance outcomes and growth? Journal of Deaf Studies and Deaf Education, 13, 117–137. Kricos, P. B., & Lesner, S. A. (1982). Differences in visual intelligibility across talkers. The Volta Review, 84, 219–225. Lesner, S. A. (1988). The talker. The Volta Review, 90, 89–98. MacLeod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise. British Journal of Audiology, 12, 131–141. Mazzoni, D., Brubeck, M., Crook, J., Johnson, V., & Meyer, M. (2006). Audacity: A free digital audio editor. http://audacity.sourceforge.net McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception and Psychophysics, 58, 351–362. Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility. Psychological Science, 15, 133–137. Rosen, S., Faulkner, A., & Wilkinson, L. (1999). Perceptual adaptation by normal listeners to upwards shifts of spectral information in speech and its relevance for users of cochlear implants. Journal of the Acoustical Society of America, 106, 3629–3636.

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012

497

Pilling and Thomas

Rota, G., Turicchia, L., Veit, R., Guazzelli, M., Birbaumer, N., & Dogilf, G. (2008). Perceptual learning of speech processed by a Cochlear Implant simulator: An fMRI investigation. International Journal of Psychophysiology, 69, 225–226. Saldaña, H. M., Pisoni, D. B., Fellowes, J. M., & Remez, R. E. (1996). Audio-visual speech perception without speech cues. Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, PA, 2187–2190. Schwartz, J. L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: Evidence for early audiovisual interactions in speech identification. Cognition, 93, B69–B78. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270, 303–304. Shannon, R. V., Zeng, F. G., & Wygonski, J. (1998). Speech recognition with altered spectral distribution of envelope cues. Journal of the Acoustical Society of America, 104, 2467–2476. Soto-Faraco, S., Navarra, J., & Alsius, A. (2004). Assessing automaticity in audiovisual speech integration: Evidence from the speeded classification task. Cognition, 92, B13–B23. Stacey, P., & Summerfield, Q. (2007). Effectiveness of computer-based auditory training in improving the perception of noise-vocoded speech. Journal of the Acoustical Society of America, 121, 2923–2935. Stacey, P., & Summerfield, Q. (2008). Comparison of word-, sentence-, and phoneme-based training strategies in improving the perception of spectrally distorted speech. Journal of Speech, Language & Hearing Research, 51, 526–538. Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 3–51). London: Lawrence Erlbaum. Teinonen, T., Aslin, R. N., Alku, P., & Csibra, G. (2008). Visual speech contributes to phonetic learning in 6-month-old infants. Cognition, 3, 850–855. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Science, 102, 1181–1186.

Downloaded from las.sagepub.com at Oxford Brookes University on January 25, 2012