Dichotic speech recognition in noise using reduced spectral cues

Dichotic speech recognition in noise using reduced spectral cues Philipos C. Loizoua) and Arunvijay Mani Department of Electrical Engineering, University of Texas at Dallas, P.O. Box 830688, EC 33, Richardson, Texas 75083-0688

Michael F. Dorman Arizona Biomedical Institute, Arizona State University, Tempe, Arizona 85287

共Received 30 August 2002; revised 25 April 2003; accepted 25 April 2003兲 It is generally accepted that the fusion of two speech signals presented dichotically is affected by the relative onset time. This study investigated the hypothesis that spectral resolution might be an additional factor influencing spectral fusion when the spectral information is split and presented dichotically to the two ears. To produce speech with varying degrees of spectral resolution, speech materials embedded in ⫹5 dB S/N speech-shaped noise were processed through 6 –12 channels and synthesized as a sum of sine waves. Two different methods of splitting the spectral information were investigated. In the first method, the odd-index channels were presented to one ear and the even-index channels to the other ear. In the second method the lower frequency channels were presented to one ear and the high-frequency channels to the other ear. Results indicated that spectral resolution did affect spectral fusion, and the effect differed across speech materials, with the sentences being affected the most. Sentences, processed through six or eight channels and presented dichotically in the low–high frequency condition were not fused as accurately as when presented monaurally. Sentences presented dichotically in the odd–even frequency condition were identified more accurately than when presented in the low–high condition. © 2003 Acoustical Society of America. 关DOI: 10.1121/1.1582861兴 PACS numbers: 43.71.Es, 43.71.Ky 关KWG兴

I. INTRODUCTION

Since the seminal paper by Cherry in the early 1950s 共Cherry, 1953兲 on the ‘‘cocktail party’’ effect and dichotic listening, much work has been done to understand dichotic speech perception 共Cutting, 1976兲. Several researchers have demonstrated the remarkable ability of the brain to fuse two different signals presented simultaneously in both ears 共i.e., presented dichotically兲 to a single auditory percept. Fusion of speech signals can occur in many forms 共Cutting, 1976兲. Broadbent and Ladefoged 共1957兲 demonstrated that when the listeners were presented simultaneously with a signal containing the F1 of /$~/ to the left ear and a signal containing the F2 of /$~/ to the right ear, subjects heard the stimulus as /$~/. Others 共e.g., Hawles, 1970; Cutting, 1976兲 have demonstrated that if a particular /"~/ is presented to one ear and a particular /,~/ is presented to the other ear, listeners often reported hearing a single item, /$~/. Sound localization is yet another form of fusion since the acoustic signals received in each ear differ in amplitude and phase 共time difference兲. Cutting 共1976兲 identified and analyzed six different types of fusion that occur at possibly three different levels of auditory processing. Cutting termed the fusion observed in the Broadbent and Ladefoged 共1957兲 study as ‘‘spectral’’ fusion, and that is the focus of this paper. Spectral fusion occurs when different, but complementary, spectral regions of the same signal are presented to opposite ears. Much work has been done to understand the various a兲

Electronic mail: [email protected]

J. Acoust. Soc. Am. 114 (1), July 2003

factors influencing spectral fusion 共Cutting, 1976兲. Three factors were found to be most important: relative onset time, relative intensity, and fundamental frequency 共F0兲. Cutting 共1976兲 delayed the information reaching one ear with respect to the other, and found that spectral fusion decreased significantly when the delay was increased more than 40 ms. Rand 共1974兲 found no effect on fusion when the signal presented to one ear was attenuated by as much as 40 dB, suggesting that spectral fusion is immune to relative intensity variations. Similar experiments were carried out to examine the effect of differences in fundamental frequency 共F0兲 on spectral fusion. When listeners were presented with signals differing in F0, they reported hearing two sounds, however the identity of the fused percept was maintained independent of differences in F0 共Darwin, 1981; Cutting, 1976兲. Darwin 共1981兲 demonstrated that the identification of ten three-formant vowels was unaffected by formants excited at different fundamentals or starting at 100-ms intervals. In summary, of the three factors investigated in the literature, the relative onset-timing difference seemed to have the largest effect on dichotic speech perception. In this paper, we investigate whether spectral resolution could be considered as another factor that could potentially influence spectral fusion. Spectral resolution is an issue that needs to be taken into account when dealing with cochlear implant listeners, known to receive a limited amount of spectral information. The recent introduction of bilateral cochlear implants spurred the question of whether cochlear implant listeners would be able to fuse speech information presented dichotically. The possible advantage of dichotic 共electrical兲 stimu-

0001-4966/2003/114(1)/475/9/$19.00

© 2003 Acoustical Society of America

475

TABLE I. The 3-dB cutoff frequencies 共Hz兲 of the bandpass filters used in this study. FL and FH indicate the low and high cutoff frequencies respectively of the bandpass filters. 6 channels Channels 1 2 3 4 5 6 7 8 9 10 11 12

8 channels

FL

FH

FL

FH

FL

FH

300.0 487.2 791.1 1284.5 2085.8 3387.1

487.2 791.1 1284.5 2085.8 3387.1 5500.0

261.9 526.6 857.3 1270.4 1786.5 2431.2 3236.7 4242.9

526.6 857.3 1270.4 1786.5 2431.2 3236.7 4242.9 5500.0

191.6 356.8 549.5 774.3 1036.6 1342.5 1699.4 2115.8 2601.5 3168.2 3829.2 4600.4

356.8 549.5 774.3 1036.6 1342.5 1699.4 2115.8 2601.5 3168.2 3829.2 4600.4 5500.0

lation is reduction in channel interaction, since one can stimulate the electrodes alternately across the two ears 共e.g., electrode 1 in left ear followed by electrode 2 in right ear and so on兲. To examine the effect of spectral resolution on dichotic speech recognition, noisy speech was processed into a small number 共6 –12兲 of channels and presented to normal-hearing listeners dichotically. Two different conditions were considered. In the first condition, low-frequency information was presented to one ear and high-frequency information was presented to the other ear. In the second condition, the frequency information was interleaved between the two ears with the odd-index frequency channels fed to one ear and the even-index channels fed to the other ear. At issue is whether spectral fusion is affected by 共1兲 poor spectral resolution or/ and 共2兲 the way spectral information is split 共low/high versus interleaved兲 and presented to the two ears. We hypothesize that both spectral resolution and the type of spectral information presented to the two ears will affect spectral fusion. II. DICHOTIC LISTENING IN NOISE A. Method

1. Subjects

Nine normal-hearing listeners 共20 to 30 years of age兲 participated in this experiment. All subjects were native speakers of American English. The subjects were paid for their participation. The subjects were undergraduate students from the University of Texas at Dallas. 2. Speech material

Subjects were tested on sentence, vowel, and consonant recognition. The vowel test included the syllables: ‘‘heed, hid, hayed, head, had, hod, hud, hood, hoed, who’d, heard’’ produced by male and female talkers. A total of 22 vowel tokens were used for testing, 11 produced by seven male speakers and 11 produced by six female speakers 共not all speakers produced all 11 vowels兲. These tokens were a subset of the vowels used in Loizou et al. 共1998兲 and were selected from a large vowel database 共Hillenbrand et al., 1995兲 to represent the complete area of the vowel space. The con476

12 channels

J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003

sonant test used 16 consonants in /~C~/ context taken from the Iowa consonant test 共Tyler et al., 1987兲. All /~C~/ syllables were produced by a male speaker. The sentence test used sentences from the HINT database 共Nilsson et al., 1994兲. Two lists, consisting of 20 sentences, were used for each condition, and different lists were used in all conditions. 3. Signal processing

Speech material was first low-pass filtered using a sixthorder elliptical filter with a cutoff frequency of 6000 Hz. Filtered speech was passed though a preemphasis filter 共high-pass兲 with a cutoff frequency of 2000 Hz. This was followed by band-pass filtering into n (n⫽6,8,12) frequency bands using sixth-order Butterworth filters. The cutoff frequencies of the bandpass filters are given in Table I. Logarithmic frequency spacing was used for n⫽6, and mel frequency spacing 共linear spacing up to 1000 Hz and logarithmic thereafter兲 was used for n⫽8,12. The output of each channel was passed through a full-wave rectifier followed by a second-order Butterworth low-pass filter with a cutoff frequency of 400 Hz to obtain the envelope of each channel output. Corresponding to each channel a sinusoid was generated with frequency set to the center frequency of the channel and with amplitude set to the root-mean-squared 共rms兲 energy of the channel envelope, estimated every 4 ms. The phases of the sinusoids were estimated from the FFT of the speech segment as per Loizou et al. 共1999兲. No interpolation was done on the amplitudes or phases to smooth out any discontinuities across the 4-ms segments. Two sets of sine waves were synthesized for dichotic presentation, one corresponding to the left ear and one corresponding to the right ear. The sine waves with frequencies corresponding to the left-ear channels were summed to produce the left-ear signal, and, similarly, the sinewaves corresponding to the right-ear channels were summed to produce the right-ear signal. In the condition, for instance, in which the low-frequency information was presented to the left ear and the high-frequency information was presented to the right ear, the envelope amplitudes corresponding to channels 1⫺n/2 were used to synthesize the left-ear signal and the amplitudes corresponding to channels n/2⫹1⫺n were used Loizou et al.: Dichotic listening in noise

to synthesize the right-ear signal. The levels of the two synthesized signals 共left and right兲 were adjusted so that the sum of the two levels was equal to the rms value of the original speech segment. This was done by multiplying the left and right signals by the same energy normalization value. Hence, no imbalance was introduced, in terms of level differences or spectral tilt, between the left and right envelope amplitudes. 4. Procedure

The experiments were performed on a PC equipped with a Creative Labs SoundBlaster soundcard. Stimuli were played to the listeners either monaurally or dichotically through Sennheiser’s HD 250 Linear II circumaural headphones. For the vowel and consonant tests, a graphical user interface was used that enabled the subjects to indicate their response by clicking a button corresponding to the syllable played. For the sentence test, subjects were asked to write down the words they heard. The sentences were scored in terms of percent words identified correctly 共all words were scored兲. During the practice session, the identity of the test syllables 共vowels or consonants兲 and sentences was displayed on the screen. At the beginning of each test the subjects were presented with a practice session in which the speech materials were processed through the same number of channels used in the test and presented monaurally in quiet and in ⫹5 dB speechshaped noise. For further practice, vowel and consonant tests were administered with feedback to each subject. Three repetitions were used in the feedback session. The practice session lasted approximately 2 h. After the practice and feedback sessions, the subjects were tested with the various dichotic and monaural conditions. The vowels and consonants were completely randomized and presented to the listeners six times. No feedback was provided during the test, and all the tests were done with speech embedded in ⫹5 dB speech-shaped noise taken from the HINT database. Two different dichotic conditions were considered. In the first condition, which we refer to as low–high dichotic condition, the low-frequency information 共consisting of half of the total number of channels兲 was presented to one ear, and the high-frequency information 共consisting of the remaining half high-frequency channels兲 was presented to the other ear. In the second dichotic condition, which we call odd–even 共or interleaved兲 dichotic condition, the odd-index frequency channels were presented to one ear, while the even-index channels were presented to the other ear. In the monaural condition, the signal was presented monaurally to either the left or the right ear 共chosen randomly兲 of the subject. The order in which the conditions and number of channels was presented was partially counterbalanced between subjects to avoid order effects. In the vowel and consonant tests, there were six repetitions of each vowel and each consonant. The vowels and the consonants were completely randomized. A different set of 20-sentence lists was used for each condition. Pilot data showed that the 12-channel condition in ⫹5 dB S/N yielded performance close to ceiling. Hence, for the 12-channel condition, we performed additional listening experiments to assess whether subjects were indeed integrating J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003

the information from the two ears, or whether they were receiving sufficient information in each of the two ears alone. For comparison with the odd–even stimuli presented dichotically, two additional conditions were created. In the first condition, the odd-index channels were presented to the left ear alone, and in the second condition, the even-index channels were presented to the right ear alone. Similarly, for comparison with the low–high stimuli presented dichotically, the low-frequency channels 共lower half number of channels兲 were presented to the left ear alone and the high-frequency channels were presented to the right ear alone. Sentences, vowels and consonants were processed though the one-ear conditions for the odd-even condition comparison. Vowels and consonants were also processed through the one-ear conditions for the low–high comparison. No sentences were processed for the low–high comparison since there were an insufficient number of unique sentences in the HINT database. B. Results

The mean percent correct scores for sentence, vowel and consonant recognition is shown in Fig. 1 as a function of number of channels for different presentation modes. 1. Sentences

A two-way ANOVA with repeated measures, using spectral resolution 共number of channels兲 and presentation mode 共monaural, dichotic odd–even and dichotic low–high兲 as within-subject factors, showed a significant main effect of spectral resolution 关 F(2,16)⫽82.05, p⬍0.0005兴 , a significant effect of presentation mode 关 F(2,16)⫽10.57, p ⫽0.001兴 , and a significant interaction 关 F(4,32)⫽3.36, p ⫽0.02兴 between spectral resolution and presentation mode. Posthoc tests according to Tukey 共at alpha⫽0.05) showed that there was a significant (p⬍0.05) difference between the performance obtained with the two dichotic conditions for 12 and 8 channels, but not for 6 channels. There was also a significant difference (p⬍0.05) between the performance obtained with dichotic presentation 共low-high兲 and monaural presentation for the 6- and 8-channel conditions, but not for the 12-channel condition. Figure 2 共top panel兲 compares the dichotic performance 共12 channels兲 obtained on sentence recognition with the performance obtained when the even channels were presented to the left ear alone, and the odd channels were presented to the right ear alone. Posthoc Fisher’s LSD tests showed that there was a significant difference (p⬍0.05) between the one-ear performance and the dichotic performance on sentence recognition, suggesting that subjects were able to integrate the information from the two ears. That is, the information presented in each ear alone was not sufficient to recognize sentences in ⫹5 dB S/N with high 共⬎90%兲 accuracy. 2. Vowels

A two-way ANOVA with repeated measures showed a significant main effect of spectral resolution 关 F(2,16) ⫽46.05, p⬍0.0005兴 , a significant effect of presentation mode 关 F(2,16)⫽4.79, p⫽0.023兴 , and a significant interaction 关 F(4,32)⫽7.16, p⬍0.0005兴 between spectral resolution and presentation mode on vowel recognition. Posthoc tests Loizou et al.: Dichotic listening in noise

477

according to Tukey showed that there was a significant (p ⬍0.05) difference between each of the dichotic conditions and the monaural condition for eight-channels. There was also a significant difference between the two dichotic conditions for six channels. Figure 2 compares the performance obtained dichotically 共for both conditions兲 with the performance obtained with the left and right ears only. Posthoc Fisher’s LSD tests showed that there was a significant difference (p⬍0.05) between the one-ear performance and the dichotic performance on vowel recognition, suggesting that subjects were able to integrate the information from the two ears and obtain a vowel score higher than the score obtained with either ear alone. 3. Consonants

A two-way ANOVA with repeated measures showed a significant main effect of spectral resolution 关 F(2,16) ⫽46.05, p⬍0.0005兴 , a nonsignificant effect of presentation mode 关 F(2,16)⫽4.79, p⫽0.34兴 , and a significant interaction 关 F(4,32)⫽7.16, p⫽0.002兴 between spectral resolution and presentation mode on consonant recognition. As shown in Fig. 1, the presentation mode 共dichotic versus monaural兲 had no effect on consonant recognition. Figure 2 compares the consonant performance obtained dichotically against the performance obtained with either the left or right ears alone. Posthoc Fisher’s LSD tests showed that there was a significant difference (p⬍0.05) between the one-ear performance and the dichotic performance on consonant recognition. The consonant confusion matrices were also analyzed in terms of percent information transmitted as per Miller and Nicely 共1955兲. The feature analysis is shown in Fig. 3. A two-way ANOVA performed on each feature separately showed a marginally significant effect of presentation mode for manner (p⫽0.04) and voicing (p⫽0.047) and a nonsignificant effect for place (p⫽0.31). There was a significant interaction for place and voicing (p⬍0.005), but not for manner (p⫽0.13). There was significant effect (p⬍0.005) of spectral resolution for all three features. C. Discussion

The above results indicate that the effect of presentation mode 共dichotic versus monaural兲 differed across speech materials and degree of spectral resolution. The recognition of sentences was affected the most. Recognition of vowels was also affected but to a lesser degree. As it is evident from the mean performance on sentences processed through six and eight channels, subjects were not able to fuse sentences presented dichotically in the low–high condition with the same accuracy as when presented monaurally. Subjects were also not able to fuse vowels, processed through eight channels and presented dichotically 共in either low/high or even–odd conditions兲, with the same accuracy as when presented monaurally. It should be noted that there was large subject variability in performance 共see Fig. 4兲 for vowels and sentences processed through six channels 共see, for instance, subjects’ S1, S2 and S7 performance on sentence recognition, and 478


FIG. 1. Mean identification of sentences 共scored in percent words correct兲, vowels, and consonants as a function of number of channels and presentation mode 共monaural versus dichotic兲. Error bars indicate standard errors of the mean.

subjects’ S4, S5 and S7 performance on vowel recognition兲. In contrast, subjects were able to fuse vowels and sentences processed through 12 channels very accurately and more consistently. These results suggest that spectral resolution may have a significant effect on spectral fusion depending on the dichotic presentation mode 共low/high versus interleaved兲. No such effect was found for consonants. Consonants were fused accurately with both dichotic presentations regardless of the spectral resolution. The difference in effect on spectral fusion between consonants and the other speech materials suggests that vowels and sentences might be better 共more sensitive兲 speech materials to use in studies of dichotic speech perception. This conclusion must be viewed with caution, taking into account the fact the performance variability might be partially due to the variability in speech material used in this study, with the vowel material being more variable 共produced by both male and female talkers, with mulLoizou et al.: Dichotic listening in noise

FIG. 2. 共Top panel兲 Mean speech recognition obtained when the odd- and even-index channels were presented dichotically 共hatched bars兲, the oddchannels were presented to the left ear only 共diagonally filled bars兲, and the even-channels were presented to the right ear only 共dark bars兲. 共Bottom panel兲. Mean vowel and consonant recognition obtained when the lowand high-frequency channels were presented dichotically 共hatched bars兲, the low-frequency channels were presented to the left ear only 共diagonally filled bars兲, and the high-frequency channels were presented to the right ear only 共dark bars兲. Speech materials were processed through 12 channels. Error bars indicate standard errors of the mean.

tiple productions of each vowel兲 and the consonant material being less variable 共produced by a single male talker with a single production of each consonant兲. When interpreting the results on vowel recognition, two confounding factors need to be considered. First, the filter spacing was logarithmic for the 6-channel condition and mel-like for the 8- and 12-channel conditions. We cannot exclude the possibility that a different outcome might have been obtained had a different filter spacing was used, and this warrants further investigation. Second, the 8- and 12channel conditions had a slightly larger 共by about 38 –108 Hz兲 overall bandwidth than the 6-channel condition. Given that the roll-off of the sixth-order filters used was not too sharp, and the results from our previous study 共Loizou et al., 2000a兲 indicate no significant difference on vowel recognition between different signal bandwidths, we do not believe that the additional bandwidth in the 8- and 12-channel conditions affected the outcome of this study. No effect on spectral fusion was found in this study in the identification of consonants processed through a small number of channels. This outcome is consistent with the findings in the study reported by Lawson et al. 共1999兲 with J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003

bilateral cochlear implant listeners. Lawson et al. presented consonants in /~C~/ context to two bilateral implant users. The consonants were presented dichotically in the same two conditions used in our study, even–odd and low–high conditions. The bilateral subjects were fitted with six- and eightchannel processors. Results showed a small advantage of dichotic stimulation, however the difference was not statistically significant. No experiments were done in Lawson et al. 共1999兲 with other speech materials. Identification of consonants, in general, is known to be robust to extremely low spectral-resolution conditions 共Shannon et al., 1995; Dorman et al., 1997兲 and even conditions in which large segments of the consonant spectra are missing 共Lippmann, 1996; Kasturi et al., 2002兲. Dichotic identification of consonants does not seem to be an exception. The fact that the listeners were not able to fuse with high accuracy sentences processed through six and eight channels and presented in the low–high dichotic condition cannot be easily explained given that we did not manipulate in this study the relative onset time, the F0, or the relative intensity of the two signals presented to each ear. One possible explanation is that the information presented in each ear was so Loizou et al.: Dichotic listening in noise

479

FIG. 3. Mean identification of the consonant features ‘‘manner,’’ ‘‘place,’’ and ‘‘voicing’’ as a function of number of channels and presentation mode 共monaural versus dichotic兲. Error bars indicate standard errors of the mean.

impoverished, due to the high noise level and low spectral resolution 共three or four channels in each ear兲, that it was not perceived as being speech by the auditory system, hence it was not integrated centrally. This is based on the assumption that the ability to fuse a sound presented to one ear with another sound presented to the other ear must depend on recognizing that the two components come from the same speech utterance 共and probably the same talker兲. In contrast, in the 12-channel condition the subjects had access to 6 channels of information in each ear, and we know from Fig. 2 that moderate levels of performance can be achieved in ⫹5 dB S/N with only 6 channels presented to each ear. A similar outcome was also reported by Dorman et al. 共2001兲 with HINT sentences presented dichotically in quiet with even channels fed in one ear and odd channels fed in the other ear. Sentences were processed through six channels in a manner similar to this study. Sentence scores obtained monaurally 480


were significantly higher than the scores obtained dichotically. In addition to observing a significant difference between the dichotic and the monaural conditions, a significant difference was also observed between the two dichotic conditions. The way spectral information was split and presented to the two ears was important for vowel and sentence recognition, but not for consonant recognition. For sentences processed through 8 and 12 channels, the mean scores obtained with the interleaved 共odd–even兲 condition were significantly higher than the scores obtained with the low–high condition. For vowels processed through six channels, the scores obtained with the low–high condition were significantly higher than the scores obtained with the interleaved condition. The higher scores obtained with the low–high condition in vowel recognition may be attributed to the fact that in this condition F1 information 共low-frequency information兲 was presented to one ear and F2 information 共higher-frequency information兲 was presented to the other ear. This condition must be easier to deal with compared to the more challenging condition 共interleaved兲 in which pieces of F1 and F2 information are distributed between the two ears. Subjects did not have difficulty piecing together the F1/F2 information, however, when the vowels were processed through 8 and 12 channels. The above explanation can not be easily extended to sentences, because the situation with sentences differs from that of vowels, in that listeners are relying on other cues, besides F1/F2 information, for word recognition. One possible explanation for the higher scores obtained in sentence recognition with the interleaved condition is that it provides greater dichotic release from masking 关a phenomenon first reported by Rand 共1974兲兴 compared to the low– high condition. Rand 共1974兲 presented the F1 of CV syllables to one ear, and the F2 attenuated by 40 dB to the other ear, and observed that subjects were able to identify the consonants accurately despite the large relative intensity differences between the two formants. However, when he attenuated the upper formants by 30 dB and presented the stimuli to both ears 共i.e., diotically兲, the listeners were unable to identify the consonants accurately. He attributed the advantage of dichotic presentation to release from spectral masking. Others 共e.g., Lunner et al., 1993; Pandey et al., 2001; Lyregaard, 1982; Franklin, 1975兲 have attempted to exploit the dichotic release from masking in bilateral hearing aids and advocated the use of dichotic presentation as a means of compensating for the poor frequency selectivity of listeners with sensorineural hearing loss. Lunner et al. 共1993兲 fitted three hearing-impaired listeners with eight-channel hearing aids and presented sentences dichotically, with the oddnumber channels fed to one ear and the even-number channels fed to the other ear. An improvement of 2 dB in speech reception threshold 共SRT兲 was found compared to the condition in which the sentences were presented binaurally. Franklin 共1975兲 investigated the effect of presenting a lowfrequency band 共240– 480 Hz兲 and a high-frequency band 共1020–2040 Hz兲 on consonant recognition in six hearingimpaired listeners with moderate to severe sensorineural hearing loss. The scores obtained when the low- and highfrequency bands were presented to opposite ears were sigLoizou et al.: Dichotic listening in noise

FIG. 4. Individual subject performance for speech materials processed through six channels.

nificantly higher than the scores obtained when the two bands were presented to the same ear. Unlike the above two studies, a few other studies 共e.g., Lyregaard, 1982; Turek et al., 1980兲 found no benefit of dichotic presentation with J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003

hearing-impaired listeners tested on synthetic /" $ ,/ identification 共Turek et al., 1980兲 or vowel/consonant identification 共Lyregaard, 1982兲. In the present study, speech was synthesized as a sum of Loizou et al.: Dichotic listening in noise

481

a small number 共6 –12兲 of sinewaves. Sinewave speech was advocated in our earlier work 共Dorman et al., 1997兲 as an alternative 关to noise-band simulations 共Shannon et al., 1995兲兴 acoustic model for cochlear implants. This acoustic model is not meant by any means to mimic the percept elicited by electrical stimulation, but rather to serve as a tool in assessing CI listener’s performance in the absence of confounding factors 共e.g., neuron survival, electrode insertion depth, etc.兲 associated with cochlear implants. We previously demonstrated on tests of reduced intensity resolution 共Loizou et al., 2000b兲, reduced spectral contrast 共Loizou and Poroy, 2001兲, and number of channels of stimulation 共Dorman and Loizou, 1998兲 that the performance of normal-hearing subjects, when listening to speech processed through the sinewave acoustic model, is similar to that of, at least, the better performing patients with cochlear implants. In that context, the results of the present study might provide valuable insights to bilateral cochlear implants. The performance obtained with the interleaved dichotic condition was not significantly different from the performance obtained monaurally when the speech materials were processed through 12 channels. For bilateral patients receiving a large number of channels 共⬎12兲, dichotic stimulation might therefore provide an advantage over unilateral stimulation, in that it can potentially reduce possible channel interactions since the electrodes can be stimulated in an interleaved fashion across the ears without sacrificing performance. For patients receiving only a small number of channels of information, dichotic 共electric兲 stimulation might not produce the same level of performance as unilateral 共monaural兲 stimulation, at least for sentences presented in the low–high dichotic condition. The large variability in performance among subjects should be noted however 共Fig. 4兲. Lastly, of the two methods that can be used to present spectral information dichotically, the interleaved method is recommended since it consistently outperformed the low–high method in sentence recognition. III. SUMMARY AND CONCLUSIONS

This study investigated the hypothesis that spectral resolution might be a factor, in addition to relative onset time, influencing spectral fusion when the spectral information is split and presented dichotically to the two ears. Two different methods of splitting the spectral information were investigated. In the first method, which we refer to as the odd–even method, the odd-index channels were presented to one ear and the even-index channels to the other ear. In the second method, which we refer to as low–high method, the lowerfrequency channels were presented to one ear and the highfrequency channels to the other ear. The results of the present study suggest the following. 共i兲

Spectral resolution did affect spectral fusion, and the effect differed across speech materials and the two dichotic presentation modes considered. The recognition of sentences presented in the low–high dichotic condition was affected the most. Sentences, processed through six and eight channels and presented dichotically in the low–high condition, were not fused as accurately as when presented monaurally. Vowels pro-

482


共ii兲

共iii兲

cessed through eight channels and presented dichotically 共in either the interleaved or low–high conditions兲 were not fused as accurately as when presented monaurally. In contrast, vowels and sentences processed through 12 channels were fused very accurately. Dichotically presented consonants were fused as well as monaurally presented consonants independent of spectral resolution 共number of channels兲 and the way spectral information was split and presented to the two ears. That is, the performance obtained monaurally was the same as the performance obtained dichotically in all conditions. The difference in outcomes between the consonants and the other speech materials suggests that perhaps the vowels and sentence materials are better 共more sensitive兲 materials to be used in studies of dichotic speech recognition assessment. A significant difference in performance was found in vowel and sentence recognition between the two methods used to split the spectral information for dichotic presentation. For sentence recognition, the scores obtained with the odd–even 共interleaved兲 dichotic method were significantly higher than the scores obtained with the low–high dichotic method. One possible explanation for this difference is that the odd–even method provides a greater dichotic release from masking compared to the low-high method.

ACKNOWLEDGMENTS

The authors would like to thank Dr. Richard Tyler and Dr. Ken Grant for providing valuable comments to this paper. This research was supported by Grant No. R01 DC03421 from the National Institute of Deafness and other Communication Disorders, NIH. This project was the basis for the Master’s thesis of the second author 共AM兲 in the Department of Electrical Engineering at the University of Texas—Dallas. Broadbent, D. E., and Ladefoged, P. 共1957兲. ‘‘On the fusion of sounds reaching different sense organs,’’ J. Acoust. Soc. Am. 29, 708 –710. Cherry, E. C. 共1953兲. ‘‘Some experiments on the recognition of speech, with one and with two ears,’’ J. Acoust. Soc. Am. 25, 975–979. Cutting, J. E. 共1976兲. ‘‘Auditory and linguistic process in speech perception: Inferences from six fusions in dichotic listening,’’ Psychol. Rev. 83, 114 – 140. Darwin, C. J. 共1981兲. ‘‘Perceptual grouping of speech components differing in fundamental frequency and onset-time,’’ Q. J. Exp. Psychol. A 33A, 185–207. Dorman, M., and Loizou, P. 共1998兲. ‘‘The identification of consonants and vowels by cochlear implants patients using a 6-channel CIS processor and by normal hearing listeners using simulations of processors with two to nine channels,’’ Ear Hear. 19, 162–166. Dorman, M., Loizou, P., and Rainey, D. 共1997兲. ‘‘Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs,’’ J. Acoust. Soc. Am. 102, 2403– 2411. Dorman, M., Loizou, P., Spahr, A. J., Maloff, E. S., and Wie, S. V. 共2001兲. ‘‘Speech understanding with dichotic presentation of channels: Results from acoustic models of bilateral cochlear implants,’’ 2001 Conference on Implantable Auditory Prostheses, Asilomar, Monterey, CA. Franklin, B. 共1975兲. ‘‘The effect of combining low and high-frequency bands on consonant recognition in the hearing impaired,’’ J. Speech Hear. Res. 18, 719–727. Loizou et al.: Dichotic listening in noise

Halwes, T. G. 共1970兲. ‘‘Effects of dichotic fusion on the perception of Speech,’’ Doctoral dissertation, University of Minnesota, Dissertation Abstracts International, 31, p. 1565B, 共University Microfilms No. 70-15, 736兲. Hillenbrand, J., Getty, L., Clark, M., and Wheeler, K. 共1995兲. ‘‘Acoustic characteristics of American English vowels,’’ J. Acoust. Soc. Am. 97, 3099–3111. Kasturi, K., Loizou, P., Dorman, M. 共2002兲. ‘‘The intelligibility of speech with holes in the spectrum,’’ J. Acoust. Soc. Am. 112, 1102–1111. Lawson, D., Wilson, B., Zerbi, M., and Finley C. 共1999兲. ‘‘Speech processors for auditory prostheses,’’ Fourth Quarterly Progress Report, Center for Auditory Prosthesis Research Triangle Institute. Lippmann, R. P. 共1996兲. ‘‘Accurate consonant perception without midfrequency speech energy,’’ IEEE Trans. Speech Audio Process. 4共1兲, 66 – 69. Loizou, P., and Poroy, O. 共2001兲. ‘‘Minimum spectral contrast needed for vowel identification by normal-hearing and cochlear implant listeners,’’ J. Acoust. Soc. Am. 110, 1619–1627. Loizou, P., Dorman, M., and Powell, V. 共1998兲. ‘‘The recognition of vowels produced by men, women, boys and girls by cochlear implant patients using a six-channel CIS processor,’’ J. Acoust. Soc. Am. 103, 1141–1149. Loizou, P., Dorman, M., and Tu, Z. 共1999兲. ‘‘On the number of channels needed to understand speech,’’ J. Acoust. Soc. Am. 106, 2097–2103. Loizou, P., Poroy, O., and Dorman, M. 共2000a兲. ‘‘The effect of parametric variations of cochlear implant processors on speech understanding,’’ J. Acoust. Soc. Am. 108, 790– 802. Loizou, P., Dorman, M., Poroy, O., and Spahr, T. 共2000b兲. ‘‘Speech recognition by normal-hearing and cochlear implant listeners as a function of intensity resolution,’’ J. Acoust. Soc. Am. 108, 2377–2388.


Lunner, T., Arlinger, S., and Hellgren, J. 共1993兲. ‘‘8-channel digital filter bank for hearing aid use: Preliminary results in monaural, diotic and dichotic modes,’’ Scand. Audiol. Suppl. 38, 75– 81. Lyregaard, P. E. 共1982兲. ‘‘Frequency selectivity and speech intelligibility in noise,’’ Scand. Audiol. Suppl. 15, 113–122. Miller, G. A., and Nicely, P. E. 共1955兲. ‘‘An analysis of perceptual confusions among some English consonants,’’ J. Acoust. Soc. Am. 27, 338 –352. Nilsson, M., Soli, S., and Sullivan, J. 共1994兲. ‘‘Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,’’ J. Acoust. Soc. Am. 95, 1085–1099. Pandey, P. C., Jangamashetti, D. S., and Cheeran, A. N., 共2001兲. ‘‘Binaural dichotic presentation to reduce the effect of temporal and spectral masking in sensorineural hearing impairment,’’ 142nd Meeting of the Acoustical Society of America, Ft. Lauderdale, FL. Rand, T. C. 共1974兲. ‘‘Dichotic release from masking for speech,’’ J. Acoust. Soc. Am. 55, 678 – 680. Shannon, R., Zeng, F-G., Kamath, V., Wygonski, J., and Ekelid, M. 共1995兲. ‘‘Speech recognition with primarily temporal cues,’’ Science 270, 303– 304. Turek, S., Dorman, M., Franks, J., and Summerfield, Q. 共1980兲. ‘‘Identification of synthetic /"$,/ by hearing impaired listeners under monotic and dichotic formant presentation,’’ J. Acoust. Soc. Am. 67, 1031–1040. Tyler, R., Preece, J., and Lowder, M. 共1987兲. ‘‘The Iowa audiovisual speech perception laser videodisc,’’ Laser Videodisc and Laboratory Report, Dept. of Otolaryngology, Head and Neck Surgery, University of Iowa Hospital and Clinics, Iowa City.

Loizou et al.: Dichotic listening in noise

483

Dichotic speech recognition in noise using reduced spectral cues

Dichotic speech recognition in noise using reduced spectral cues

Suggest Documents

Spectral resolution and speech recognition in noise ...

Noise Robust Speech Recognition Using Feature Compensation ...

NOISE ROBUST SPEECH RECOGNITION USING F0 ... - CiteSeerX

Dichotic Word Recognition in Noise and the Right-Ear ... - osu.edu

speech enhancement for noise-robust speech recognition

NOISE ADAPTIVE SPEECH RECOGNITION IN TIME-VARYING

Recognition of Speech Produced in Noise

Recognition of Speech Produced in Noise - ASU

Spectral Subtraction with Variance Reduced Noise

Speech Recognition using HMMs

Spectral and temporal cues to pitch in noise-excited vocoder ...

Noise-Robust Speech Recognition Using Top-Down ... - CiteSeerX

Noise Robust Speech Recognition Using Multi-Channel Based ...

Robust Speech Recognition Using PCA-Based Noise ... - CiteSeerX

STUDIES ON NOISE ROBUST AUTOMATIC SPEECH RECOGNITION

Noise Reduction in Speech Signal using ...

PHONEME RECOGNITION USING SPECTRAL ... - CiteSeerX

Speech recognition in noise in individuals with normal ... - Scielo.br

Using Automatic Speech Recognition Technology

Speech Recognition using Vector Quantization

Speech reception thresholds in noise with and without spectral and

SPEECH RECOGNITION USING NEURAL NETWORKS

IRJET- Speech Recognition using SVM

Attention in dichotic listening: Affective cues and the ...