Key words: Auditory prosthesis, digital signal processing, cochlear implants. ... a take-home device for patients a wearable battery-operated unit has been built. ... Speech signals were processed as follows: after analog low-pass filtering at 5 kHz .... fixed electrodes may be stimulated to convey information contained in three ...
Scand Audiol 1993; Suppl. 38: 145-153
Speech Discrimination via Cochlear Implants with Two Different Digital Speech Processing Strategies: Preliminary Results for 7 Patients Norbert DILLIER1, Hans BÖGLI2 and Thomas SPILLMANN3 From the 'Department of Otorhinolaryngology, University Hospital, Zürich, 2 Institute for Biomedical Engineering, University of Zürich and 3Swiss Federal Institute of Technology, Zürich, Switzerland
ABSTRACT The following processing strategies have been implemented on an experimental laboratory system of a cochlear implant digital speech processor (CIDSP) for the Nucleus 22-channel cochlear prosthesis. The first approach (PES, Pitch Excited Sampler) is based on the classical channel vocoder concept whereby the time-averaged spectral energy of a number of logarithmically spaced frequency bands is transformed into appropriate electrical stimulation parameters for up to 22 electrodes. The pulse rate at any electrode is controlled by the voice pitch of the input speech signal. The pitch extraction algorithm calculates the autocorrelation function of a lowpassfiltered segment of the speech signal and searches for a peak within a specified time window. A random pulse rate of about 150 to 250 Hz is used for unvoiced speech portions. The second approach (CIS, Continuous Interleaved Sampler) uses a stimulation pulse rate which is independent of the input signal. The algorithm scans continuously all specified frequency bands (typically between 4 and 22) and samples their energy levels. Evaluation experiments with 7 experienced cochlear implant users showed significantly better performance in consonant identification tests with the new processing strategies than with the subjects' own wearable speech processors whereas improvements in vowel identification tasks were rarely observed. Modifications of the basic PES- and CIS-strategies resulted in large variations of identification scores. Information transmission analysis of confusion matrices revealed a rather complex pattem across conditions and speech features. No final conclusions can yet be drawn. Optimization and fine-tuning of processing parameters for these coding strategies require more data both from speech identification and discrimination as well as psychophysical experiments. Key words: Auditory prosthesis, digital signal processing, cochlear implants.
INTRODUCTION Cochlear implants have been quite successful in recent years in providing partial restoration of auditory sensations and limited speech recognition for profoundly deaf subjects despite severe technological and electrophysiological constraints imposed by the anatomical and physiological conditions of the human auditory system. Electrical stimulation via implanted electrodes allows only an extremely limited approximation of normal neural excitation patterns in the auditory nerve. Signal processing for cochlear implants, therefore, is confronted with the problem of a severely restricted channel capacity and the necessity to select and encode a subset of the information contained in the sound signal reaching the listener's ear. Several processing strategies have been designed and evaluated in the past varying the number of electrodes and the amount of specific speech feature extraction and mapping transformations (Clark et al., 1990). Recently, Wilson and coworkers (1991) reported astonishing improvements in speech performance when they provided their subjects with high-rate pulsatile stimulation patterns rather than analog broadband signals. They attributed this effect partly to the decreased current summation obtained by non-simultaneous stimulation of different electrodes (which might otherwise have partly stimulated the same nerve fibres and thus interacted in a nonlinear fashion) and partly to a
146
N. Dillier et al.
fundamentally different and possibly more natural firing pattem due to the high stimulation rate. Skinner et al. (1991) also found significantly higher scores on word and sentence tests in quiet and noise with a new multipeak speech coding strategy as compared to the formerly used FOF1F2-strategy of the Nucleus-WSP (wearable speech processor). These results indicate the potential gains which may be obtained by optimizing signal processing schemes for existing implanted devices. With single-chip digital signal processors (DSP's) different speech coding strategies can be evaluated in relatively short laboratory experiments. In addition to the well known strategies realized with filters, amplifiers and logic circuits, a DSP approach allows the implementation of much more complex algorithms such as nonlinear multiband loudness correction, speech feature contrast enhancement, adaptive noise reduction and many more. It can also be anticipated that progress in electronics will allow further miniaturization and low-power operation of these processors in the near future. The present study was conducted in order to explore new ideas and concepts of multichannel pulsatile speech encoding for users of the Clark/Nucleus cochlear prosthesis.
MATERIALS AND METHODS As previously described (Dillier et al., 1990) a cochlear implant digital speech processor (CIDSP) for the Nucleus 22-channel cochlear prosthesis was implemented using a single chip digital signal processor (TMS320C25, Texas Instruments). For laboratory experiments the CIDSP was incorporated in a general purpose computer (PDP11/73) which provided interactive parameter control, graphical display of input/output and intermediate buffers and offline speech file processing facilities. In addition to the generation of stimulus parameters for the cochlear implant an acoustic signal based on a perceptive model of auditory nerve stimulation was generated simultaneously. For field studies and as a take-home device for patients a wearable battery-operated unit has been built. Advantages of a DSPimplementation of speech encoding algorithms as opposed to offline prepared test lists are increased flexibility, controlled, reproducible and fast modifications of processing parameters, use of running speech for training and familiarization. Disadvantages are the more complex programming and numerical problems with integer arithmetic. Speech signals were processed as follows: after analog low-pass filtering at 5 kHz and analog-todigital-conversion (sampling rate 10 kHz), preemphasis and Hanning windowing (12.8 ms, shifted by 6.4 ms or less per analysis cycle) the power spectrum was calculated via fast Fourier transform (1-1--1); speech features such as formants and voice pitch were extracted and transformed according to the Power Spectrum
1= V : —
CD CD
2 ts Ui
31 1
1 z
= 1=
■
1 1
1 ■
=1 22 Tp
maximally 6 peaks (formants)
Fig. 1. The amplitudes and frequencies of up to 6 spectral peaks are detected.
Tp).4 Tp>4 Trk Tp Tp Tp pitch period
. — E time
Fig. 2. Stimulation pulses are generated at electrodes corresponding to peak frequencies for every pitch period.
147
Speech processing strategies for cochlear implants
selected encoding strategy; finally, the stimulus parameters (electrode position, stimulation mode, pulse amplitude and duration) were generated and transmitted via inductive coupling to the implanted receiver. Two processing strategies were implemented on this system: The first approach (PES, Pitch Excited Sampler) is based cm the classical channel vocoder concept whereby the time-averaged spectral energy of a number of logarithmically spaced frequency bands is transformed into appropriate electrical stimulation parameters for up to 22 electrodes (Fig. 1). The pulse rate at any electrode is controlled by the voice pitch of the input speech signal (Fig. 2). The pitch extractor algorithm calculates the autocorrelation function of a lowpass-filtered segment of the speech signal and searches for a peak within a specified time window. A random pulse rate of about 150 to 250 Hz is used for unvoiced speech portions. The second approach (CIS, Continuous Interleaved Sampler) uses a stimulation pulse rate which is independent of the input signal. The algorithm scans continuously all frequency bands and samples their energy levels (Fig. 3). As only one electrode can be stimulated at any instant (Fig. 4) the rate of stimulation is limited by the required stimulus pulse widths (as determined individually for each subject) and some additional constraints and parameters. This is illustrated in Fig. 5 where maximally achievable pulse rates for various stimulation parameters are displayed. As the information about the electrode number, the stimulation mode, the pulse amplitude and width is encoded by high frequency (2.5 MHz) bursts of different durations the total transmission time for a specific stimulus depends on all of these parameters. As Fig. 5 indicates this transmission time can be minimized by choosing the shortest possible pulse width combined with the maximal amplitude. For very short pulse durations the overhead imposed by the transmission of the flxed stimulus parameters can become rather large. Consider, for example, the stimulation of electrode pair at 50 gs. The maximally achievable rate varies from about 3600 Hz for high amplitudes to about 2700 Hz for low amplitudes whereas the theoretical limit would be close to 10000 Hz (biphasic pulses with minimal interpulse interval). In cases with higher pulse width requirements (which may be due to poor nerve survival or unfavourable electrode position or other unknown factors) the overhead becomes smaller. The temporal sequence of stimulated electrodes is probably an important parameter for high-rate stimulation algorithms. A number of different rules have therefore been implemented which specify minimal temporal and spatial distances. Other modifications and extensions of the basic PES- and CISstrategies include enhancement of speech feature contrasts such as peak-to-valley relations for vowel formants and spectral gravity in high-frequency consonants. IdB1
Power Spectrum
CIS-NA
= 1 : - : - - • • .
W
/a/ rl
- • - _ -- ä • • • 0 » • . 1 yD
cr _ _
=
-20-
I
=
-40,
E 22
di I/ j lz) 0 1111111e11111Ul Fig. 3. The spectral energies of all specified frequency bands are mapped to stimulus amplitudes.
19 W
=5 E
4
Ts
_
E —
ui
1(\
-=g
.0 = 5- . .>.
Ts Ts Ts stimulation period
c
3 -...g ii—... -....... -......._ -.......a _ L•1
time
Fig. 4. Stimulation pulses are continuously generated for all selected electrodes at maximum spatial distance.
148
N. Dillier et al. Maximal pulse rates (per electrode pair)
"Continuous Interleaved Sampler"
Strategies of short-time spectrum mapping
5'000
Freauencv band
4'000 4..50 us/Ph high +50 us/Ph low *100 us/Ph high B100 us/Ph low X200 us/Ph high 4-200 us/Ph low
3'000 2'000 1'000
Mapping
Calgt
N peak channels
CIS-NP
narrow bands all channels -s-- CIS-NA (above NCL) fixed tonotopy
(21,22) (1,2) (10,11) bipolar mode, 16 us inter-burst interval
Fig. 5. Examples of maximally achievable pulse
CIS-WF
wide bands \ v a
riable tonotopy
CIS-VVV
Fig. 6. Variations of the CIS-strategy.
rates. See text for details.
In order to achieve maximally high stimulation rates for those portions of the speech signal which are assumed to be most important for intelligibility several modifications of the basic CIS-strategy were designed as indicated in Fig. 6. Analysis of the short-term spectra was performed either for a large number of narrow frequency bands (corresponding directly to the number of available electrodes) or for a small number (typically 6) of wide frequency bands, analogous to the approach suggested by Wilson et al. (1991). The frequency bands were logarithmically spaced from 200 to 5000 Hz. The mapping of spectral energy within any of these frequency bands to stimulus amplitude at a selected electrode was done according to several rules. From the narrow-band spectra either a preselected number of peaks (typically again 6) or all channels whose values exceeded a noise threshold level were used. The Erst scheme (CIS-NP) relies an spectral feature extraction and thus closely resembles the PES-strategy (see Fig. 1), whereas the second (CIS-NA, see Figs. 3 and 4) is unique. Two variations of the wide band analysis scheme were implemented, which mapped the spectral energy of each band either to the same preselected electrode (fixed tonotopical mapping, CIS-WF) or to Chose electrodes within the wide bands which corresponded to local peaks (CIS-WV). The CIS-WF scheme was intended to rninimize electrode interactions by preserving maximal distances between stimulated electrodes while the CIS-WV would make optimal use of tonotopical frequency selectivity. In both the PES- and the CIS-strategies a high-frequency preemphasis was applied whenever a spectral gravity measure exceeded a preset threshold.
RESULTS Evaluation experiments have so far been conducted with 7 cochlear implant users. All subjects were experienced users of their speech processors (Urne since implantation ranged from 5 months (KW) to nearly 10 years (UT)) with minor open speech discrimination in monosyllabic word tests (scores from 5 to 25 %) and limited use of the telephone. One subject (UT) still used the old wearable speech processor (WSP) which extracts only the first and second formant and Mus stimulates only two electrodes per pitch period. The other 6 subjects used the new miniature speech processor (MSP) with the so-called multipeak strategy whereby, in addition to first and second formant information, three fixed electrodes may be stimulated to convey information contained in three higher frequency bands. The measurement procedure to determine thresholds of Kearing (T-levels) and cornfortable listening (C-levels) for Fitting the WSP or MSP was also used for the CIDSP-strategies. As most subjects used fixed amplitudes and varying pulse widths (so-called stimulus levels) with their MSP's and the CIDSP-
Speech processing strategies for cochlear implants
149
algoritluns required fixed pulse widths and varying amplitudes. all T- and C-levels were remeasured prior to speech tests. Overall loudness of processed signals was adjusted by proportional factors (Tand C-modifiers) if necessary following short listening sessions with ongoing speech and environmental sounds played from a tape recorder. Only minimal familiarization with the new processing strategies was possible due to time restrictions. After about 5 to 10 minutes of listening to ongoing speech one or two blocks of a 20-items 2-digit numbers test with feedback was performed. There was no feedback given during the actual test trials. All test items were presented by a second computer which also recorded the subject' s responses entered via touch screen (for multiple choice tests) or keyboard (numbers tests and monosyllable word tests). Speech signals were either presented via loudspeaker in a sound treated room (when patients were tested with their wearable speech processors) or processed by the CIDSP in real time and fed directly to the transmitting coil at the subject's head. Different speakers were used for the ongoing speech, the numbers test and the actual speech tests, respectively. The first two subjects tested with the CIS-strategy (HM and KK) were selected because of their low stimulation thresholds which were achieved by monopolar stimulation of the intrascalar electrodes against an external reference electrode. The pulse widths for subjects HM and KK were typically fixed at 20 .is whereas pulse widths for the other subjects (stimulated with bipolar mode) had to be set from 100 to 200 ie. Results for a 4AFC 100-item 2-syllable consonant rhyme test are shown in Figs. 7 and 8. Total scores (first group of bars) have been corrected for chance level whereas the other values represent percent information transmitted according to the method described by Miller and Nicely (1955) for the phonological features voicing, sonorance, frication and place of articulation. Both subjects showed rather similar performance in these tests and improved their scores by more than 20% with both CIDSP-strategies compared to their own speech processor. Feature analysis showed that most of the improvement was obtained by heuer transmission of sonorant (voicing) information whereas recognition of fricatives was not significantly changed. There did not seem to be a difference between the PES- and CIS-strategy for these subjects. Note, however, that only the first version (CIS-NP) of the CIS-strategies described above could be tested. Apparently, the lack of explicit voice pitch information did not result in lower voicing scores although the subjects' qualitative impressions favoured the PES-strategy. After these promising initial results a series of systematic investigations with five additional subjects was started which is still ongoing. Complete data of subject UT with all four CIS-variations are shown in Fig. 9. Information transmission analysis of the confusion matrices (12 consonants in /aCa/ format, at least 144 trials per condition) revealed a rather complex pattern across conditions and speech features. As for HM and KK the scores with the subject's own wearable speech processor were 100 90 80 70 60 50 40 30 20 10 0
Transmittecl Information (%)
Overall Voicing Sonorance Frication 117114.4SP im PES IROIS-NP
100 90 80 70 60 50 40 30 20 10 0 Place
Fig. 7. Two-syllable words, medial consonants minimal pair test with subject HM.
Transmitted Information (%)
Overall Voicing Sonorance Frication roMSP au PES mOIS-NP
Place
Fig. 8. Two-syllable words, medial consonants minimal pair test with subject KK.
N. Dillier et al.
150 100 90 80 70 60 50 40 30 20 10 0
Transmitted Information (%)
• ••
•e'' 'J
-:>
,..,b-c.;''
edb ..,,e`xc.,0 • zehoc
c-pSe
e
oWSP • PES miCIS-NPINCIS-NAnaCIS-WRICIS-WV il
Fig. 9. Comparison of six different speech encoding strategies with subject UT.
significantly lower than all tested CIDSP-strategies. The total scores for PES, CIS-NP and CIS-WV were about equal whereas the scores for CIS-NA and CIS-WF were again higher by 20 to 30 %. Inspection of the phonological feature transmission percentages either in Fig. 9 or in the more detailed sequential analysis (Wang and Bilger, 1973) of Table I indicates that this improvement is mostly caused by better discrimination of fricative and sibilant information and partly by an increase in place and voicing information. It is interesting to see that again voicing information transmission is greatly improved although the CIS-strategies do not explicitly encode the voice pitch while the wearable speech processor and the PES-strategy do. The low frication scores with the WSP may be due to this particular processor with the selected FOF1F2-strategy and might have been higher with the new MSP (further experiments at a later date have indeed shown an increase in the frication scores for this subject after she had switched to the new processor). Table I. Sequential information transmission analysis (SINFA) of consonant confusion matrices (subject LIT). WSP
PES
CIS-NP
CIS-NA
CIS-WF
CIS-WV
Rel. trans.
41.3
65.3
67.2
80.9
80.6
67.1
Voicing Nasality Sonorance Sibilance Frication Place Manner
10.6 4.3 40.6
11.5
18.5 4.9 38.1 5.4
14.1
10.1 32.6
13.3
17.4 24.0
13.8
31.7 12.5 11.9 26.7
17.2 2.2 31.8 12.5
Total
68.7
92.1
80.7
96.9
39.2
31.4
16.8 27.2
95.1
86.7
151
Speech processing strategies for cochlear implants 100 90 80 70 60 50 40 30 20 10 0
Total score (chance level corrected, %) 100 90 80 70 60 50 40 30 20 10 0 UT
KW
Total score (chance level correoted, %)
HS
UT
LoMSP.PES ®CIS-NA
Fig. 10. Total scores for four subjects and three processing schemes (12-consonant test).
TH KW CIMSPiaiPESsaCIS-NA
HS
Fig. 11. Total scores for four subjects and three processing schemes (8-vowel test).
Figs. 10 and 11 summarize the results for 4 subjects who had completed a full set of consonant and vowel tests with the same three processing conditions. lt can be seen in Fig. 10 that three of these subjects (UT, TH, HS) performed significantly better in consonant identification tests with the new CIS-NA strategy than with their own wearable speech processors. Scores for the PES-processing were about equal to his MSP-scores for HS and between their MSP- and CIS-scores for UT and TH. No improvement with either PES nor CIS-NA was noted for subject KW. SINFA for the pooled consonant confusion matrices for these subjects revealed again that information about sibilance but also place of articulation was probably the main cause for better overall scores. Vowel identification scores on the other hand were rarely improved by modifications of the signal processing strategy. As Fig. 11 indicates, only one of the four subjects (HS) showed significant improvement in total scores whereas another subject (TH) had markedly deteriorated performance with either PES and CIS-strategies. One potential area for further optimization of digital speech processing strategies could be the Eine tuning of input signal energy to stimulation amplitude mapping parameters. In order to investigate the influence of this function a simple piecewise linear mapping function was used in a pilot experiment with two subjects. As schematically shown in Fig. 12 this function depends on only one parameter (the kneepoint factor KP which specifies the percentage of the output dynamic range reached at half the input range). C-Level
Amplitude mapping
CONSONANT RECOGNITION (PES-STRATEGY)
6 70 % ftf
70%
5. *E3 50% CL
50%
E
r n 30%
30 % Log (Input)
T-Level 50%
100%
Fig. 12. Mapping of output signal energy to stimulation amplitude by piecewise linear functions.
100 80 60 40 20 0 20 40 60 80 100 % °wetzt (TH) % correct (SA)
Fig. 13. 12-consonant test scores for two subjects with three different kneepoint factors.
1
152
N. Dillier et al.
A low KP expands the loud sounds whereas a high KP compresses loud sounds and emphasizes soft sounds. The results for subjects SA indicate that a higher KP would be preferable whereas TH's results are ambiguous (Fig. 13). Optimization and Eine-tuning of processing parameters for these mapping functions therefore requires more data from psychophysical measurements such as loudness growth estimations for individual electrodes.
DISCUSSION AND CONCLUSIONS The above speech test results should be interpreted with caution at this time. The number of subjects is still very small and data collection has not yet been completed for all of them in every processing condition. The experiments with HM and KK (using monopolar stimulation and very small pulse widths) were done at an early stage of algorithm development and will be continued in the near future. It is hoped that other variations of the CIS-strategy will lead to even larger improvements in recognition scores for these optimal cases. Inspection of the SINFA-results for UT seems to indicate a strong preference for a non-featureextraction approach such as maximal pulse rates on either all electrodes (narrow-band analysis with wide spatial separation of sequentially stimulated electrodes) or a limited fixed set of electrodes (wideband analysis with preservation of fine temporal envelope information). These Eindings need to be confirmed with more subjects. There are still many variables whose influence has not yet been elucidated. There may be other options still to be explored for increasing pulse rates even in subjects with higher thresholds. The relationship between maximal stimulation rate and speech recognition remains to be further investigated. Whether the auditory nerve is reacting in a fundamentally different manner at these continuous high excitation rates and thus would generate a more natural auditory percept could be a very interesting basic research question. It is, however, very prornising at this point that new signal processing strategies can considerably improve speech discrimination. Consonant identification apparently may be enhanced by more detailed temporal information and specific speech feature transformations. Whether these improvements will pertain in the presence of interfering noise also remains to be verified. Further optimization of these processing strategies should preferably be based on more data about loudness growth functions for individual electrodes or additional psychophysical measurements. Although many aspects of speech encoding can be efficiently studied using a laboratory digital signal processor it would be desirable to allow subjects more time for adjustment to a new coding strategy. Several days or weeks of habituation are sometimes required until a new mapping can be fully exploited. Thus, for scientific as well as for practical purposes the miniaturization of wearable DSP's will be of great importance.
ACKNOWLEDGEMENTS This work was supported by the Swiss National Research Foundation (grants no. 4018-10864 and 4018-10865). Implant surgery was performed by Prof. U. Fisch. Valuable help was also provided by Dr. E. von Wallenberg of Cochlear AG, Basel. We are also grateful to Prof. E. Lehnhardt of the Medizinische Hochschule Hannover for allowing us to conduct experiments with two of his patients.
Speech processing strategies for cochlear implants
153
REFERENCES Clark GM, Tong YT, Patrick JF (1990). Cochlear Prostheses. Churchill Livingstone, Edinburgh, 1-264. Dinier N, Senn C, Schlauer T, Stöckli M, Utzinger U (1990). Wearable digital speech processor for cochlear implants using a TMS320C25. Acta Otolaryngol (Stockh) Suppl 469, 120-127. Miller GA, Nicely PE (1955). An analysis of perceptual confusions among some English consonants J Acoust Soc Am 27, 338-352. Skinner MW, Holden LK, Holden TA, Dowell RC (1991). Performance of postlingually deaf adults with the wearable speech processor (WSP III) and mini speech processor (MSP) of the Nucleus Multi-Electrode Cochlear Implant. Ear Hearing 12(1), 3-22. Wang MD, Bilger RC (1973) Consonant confusions in noise: A study of perceptual features. J Acoust Soc Am 54, 1248-1266. Wilson BS, Finley CC, Lawson DT, Wolford RD, Eddington DK, Rabinowitz WM (1991). Better speech recognition with cochlear implants. Nature 352: 236-238.