Digital Speech Processing for Cochlear lmplants - UZH

2 downloads 0 Views 3MB Size Report
Analog and digital signal processing for cochlear implants. ... existing implanted devices. ... stimulated to convey information contained in three higher fre-.
ORL Managing Editor W. Arnold, München

Original Paper

Separatum Publisher: S.Karger AG, Basel Printed in Switzerland

Norbert Dillier Hans Bögli Thomas Spillmann Department of Otorhinolaryngology, University Hospital, and the Institute for Biomedical Engineering, University of Zürich and Swiss Federal Institute of Technology, Zürich, Switzerland

Key Words Auditory prosthesis Digital signal processing Cochlear implants

ORL 1992;54:299-307

Digital Speech Processing for Cochlear lmplants

Abstract A rather general basic working hypothesis for cochlear implant research might be formulated as follows. Signal processing for cochlear implants should carefully select a subset of the total information contained in the sound signal and transform these elements into those physical stimulation parameters which can generate distinctive perceptions for the listener. Several new digital processing strategies have thus been implemented on a laboratory cochlear implant speech processor for the Nucleus 22-electrode system. One of the approaches (PES, pitch excited sampler) is based on the maximum peak channel vocoder concept whereby the spectral energy of a number of frequency bands is transformed into appropriate electrical stimulation parameters for up to 22 electrodes using a voice pitch synchronous pulse rate at any electrode. Another approach (CIS, continuous interleaved sampler) uses a maximally high pitch-independent stimulation pulse rate on a selected number of electrodes. As only one electrode can be stimulated at any instance of time, the rate of stimulation is limited by the required stimulus pulse widths (as determined individually for each subject) and some additional constraints and parameters which have to be optimized and fine tuned by psychophysical measurements. Evaluation experiments with 5 cochlear implant users resulted in significantly improved performance in consonant identification tests with the new processing strategies as compared with the subjects own wearable speech processors whereas improvements in vowel identification tasks were rarely observed. The pitch-synchronous coding (PES) resulted in worse performance compared to the coding without explicit pitch extraction (CIS). A great portion of the improvement is probably due to better transmission of sibilant and fricative (and to a lesser extent place of articulation) information.

Introduction Research and development of optimized ways to restore auditory sensations and speech recognition for profoundly deaf subjects have concentrated in recent years very much on investigations of signal processing strate-

gies. A number of technological and electrophysiological constraints imposed by the anatomical and physiological conditions of the human auditory system have to be considered. One basic working hypothesis for cochlear implants is the idea that the natural Eiring pattern of the auditory nerve should be as closely approximated by elec-

Dr. N. Dillier Department of Otorhinolaryngology University Hospital CH-8091 Zürich (Switzerland)

1992 S. Karger AG, Basel 0301-1569/92/0546-0299 $2.75/0

ANALOG SIGNAL PROCESSING DIGITAL SIGNAL PROCESSING CH-1

Opt.

CH-1

CLOCK

V INPUT -› ADC

CPU

MUX

INPUT

et

... 1

CH-3 DATA MEM.

PROG. MEM.

P CH-N

--> CH-4

Fig. 1. Analog and digital signal processing for cochlear implants. In the example of analog signal processing, 4 band-pass filters are used to generate stimulation signals for 4 electrodes similar to the scheme employed in the Symbion/Ineraid system. Digital signal processing details are hidden in the structure of the program and the selected algorithms and stimulus parameters. ADC = Analog-digital converter; CPU = central processing unit; MEM = memory; MUX = multiplexer; CH = channel.

trical stimulation as possible. The central processor (the human brain) would then be able to utilize natural (`prewired' as well as learned) analysis modes for auditory perception. An alternative hypothesis is the Morse code idea, which is based an the assumption that the central processor would be as flexible as to interpret any transmitted stimulus sequence after proper training and habituation. Both hypotheses have nev er reale been tested for obvious reasons. On the one hand, it is not possible to reproduce the activity of 30,000 individual nerve fibers with current electrode technology. In fact, it is even questionable whether it is possible to reproduce the detailed activity of a single auditory nerve fiber via artificial stimulation. There are a number of fundamental physiological differences in firing patterns of acoustically versus electrically excited neurons, which are hard to overcome. Spread of excitation within the cochlea and current summation are other major problems of most electrode configurations. On the other hand, the coding and transmission of spoken language requires a much larger communication channel bandwidth and more sophisticated processing than a Morse code for a written text. Practical experiences with cochlear implants in the Aast indicate that some natural relationships (such as growth of loudness and voice pitch variations) should be maintained in the encoding process. One might therefore conceive a third, more realistic, hypothesis described as follows: sig-

300

Dillier/Bögli/Spillmann

nal processing for cochlear implants should carefully select a subset of the total information contained in the sound signal and transform these elements into those physical stimulation parameters which can generate distinctiv e perceptions for the listener. Many researchers have designed and evaluated different systems varying the number of electrodes and the amount of specific speech feature extraction and mapping transformations used [1]. Recently, Wilson et al. [2] have reported astonishing improvements in speech test performance when they provided their subjects with high rate pulsatile stimulation patterns rather than analog broadband signals. They attributed this effect partly to the decreased current summation obtained by nonsimultaneous stimulation of different electrodes (which might otherwise have stimulated partly the same nerve fibers and thus interacted in a nonlinear fashion) and partly to a fundamentally different and maybe more natural firing pattern due to an extremely high stimulation rate. Skinner et al. [3] also found significantly higher scores an word and sentence tests in quiet and noise with a new multipeak digital speech coding strategy as compared to the formerly used FOF1F2 strategy of the Nucleus-WSP (wearable speech processor). These results indicate the potential gains which may be obtained by optimizing signal processing schemes for existing implanted devices. With single chip programma-

Digital Speech Processing

Table 1. Subjects

Patient identification U.T.

T.H.

H.S.

S.A.

K.W.

Sex Date of birth, month/year

Female 6/1941

Male 2/1965

Male 11/1944

Female 7/1962

Male 3/1947

Etiology

Sudden deafness

Trauma

Sudden deafness

Sudden deafness

Meningitis

Duration, years Implantation date Side Speech processor Strategy Electrodes Stimulus mode T/C level (mean charge/phase), nC

15 3/87 Left WSP FOF1F2 16 BP

3 4/87 Right MSP FOF1F2 20 BP+1

14 11/88 Right MSP MPEAK 19 BP

1 3/89 Left MSP MPEAK 20 BP

28 12/90 Left MSP MPEAK 18 BP

73/137

79/157

38/84

37/62

74/130

Pulse width, ps

150-204

204

100

100

204

Sentence test (4AFC),

90

85

80

85

95

2-digit number test, %

55

95

85

40

80

Monosyllables test,

5

20

15

20

10

MPEAK = Multipeak; 4AFC = four alternative forced choice; BP = bipolar.

ble digital signal processors (DSPs), it has become possible to evaluate different speech coding strategies in relatively short laboratory experiments with the same subjects. Figure 1 shows the basic differences between analog and digital signal processing. In addition to the wellknown strategies realized with analog filters, amplifiers and logic circuits, a DSP approach allows the implementation of much more complex algorithms. Changes in DSP algorithms require only software or parameter changes in contrast to modifications of electronic hardware which are necessary with analog devices. Further miniaturization and low power Operation of these processors will be possible in the near future. The present study was conducted in order to explore new ideas and concepts of multichannel pulsatile speech encoding for users of the Clark/Nucleus cochlear prosthesis. Similar methods and tools can however be utilized to investigate alternative coding schemes for other implant systems equally well.

Subjects and Test Procedures Evaluation experiments have been conducted with 5 postlingually deaf adults (age 26-50 years) who are cochlear implant users. As can be seen from table 1, all subjects were experienced users of their speech processors. The time since implantation ranged from 5

months (K.W.) to nearly 10 years (U.T., single channel extracochlear implantation in 1980, reimplanted after device failure in 1987) with good sentence identification (80-95% correct responses) and number recognition (40-95% correct responses) performance and minor open speech discrimination in monosyllabic word tests (5-20% correct responses, all tests presented via computer, hearing alone) and limited use of the telephone. One subject (U.T.) still used the old wearable speech processor (WSP) which extracts only the first and second formant and thus stimulates only two electrodes per pitch period. The other 4 subjects used the new miniature speech processor (MSP) with the so-called multipeak strategy whereby in addition to first and second formant information three fixed electrodes may be stimulated to convey information contained in three higher frequency bands. The same measurement procedure to determine thresholds of hearing (T levels) and comfortable listening (C levels) was used for the cochlear implant digital speech processor (CIDSP) strategies as was used for fitting the WSP or MSP. Figure 2a shows one example of measured thresholds of hearing (T levels) and comfortable listening levels (C levels) for 21 bipolar electrode pairs (subject S.A.). There can be considerable variation in these values from electrode to electrode which may reflect different electrode-to-neuron distances or varying neural excitability. Amplitude and pulse width are inversely related according to figure 2b. As most subjects used fixed amplitudes and varying pulse widths (so-called stimulus levels) with their MSPs and the CIDSP algorithms required fixed pulse widths and varying amplitudes, all T and C levels were first remeasured prior to speech tests. Overall loudness of processed signals was adjusted by proportional factors (T and C modifiers) if necessary following short listening sessions with ongoing speech and environmen-

301

tal sounds played from a tape recorder. Loudness growth functions were measured using an automated randomized psychophysical test procedure to determine appropriate amplitude mapping functions (fig. 2c). Only minimal exposure to the new processing strategies was possible due to time restrictions. After about 5-10 min of listening to ongoing speech, one or two blocks of a 20-item 2-digit number test with feedback of correct or wrong responses were done. There was no feedback given during the actual test trials. All test items were presented by a second computer which also recorded the subjects responses entered via touch screen terminal (for multiple choice tests) or keyboard (number tests and monosyllable word tests). Speech signals were either presented via loudspeaker in a soundtreated room (when patients were tested with their wearable speech processors) or processed by the CIDSP in real time and fed directly to the transmitting coil at the subjects head. Different speakers were used for the ongoing speech, the number test and the actual speech tests, respectively.

DYNAMIC RANGE (DB)

CHARGE PER PHASE (nC)

40

1000

- 30 - 20

100

- 10 0

10 1 2 3 4 5 6 7 8 9 101112131415161718192021 ELECTRODE NUMBER

a

(100 psec, BP) TLEV T CLEV D RANGE

Signal Processing Strategies A CIDSP for the Nucleus 22-channel cochlear prosthesis was designed using a single chip digital signal processor (TMS320C25, Texas Instruments [4]. For laboratory experiments, the CIDSP was incorporated in a general purpose computer which provided interactive parameter control, graphical display of input/output and buffers and offline speech file processing facilities. The experiments described in this paper were all conducted using the laboratory version of CIDSP. Speech signals were processed as indicated in figure 3. After analog low pass filtering (5 kHz) and analog-to-digital conversion (10 kHz), preemphasis and Hanning windowing (12.8 ms, shifted by 6.4 ms or less per analysis frame) were applied, and the power spectrum calculated via fast Fourier transform; specified speech Features, such as formants and voice pitch, were extracted and transformed according to the selected encoding strategy; finally, the stimulus parameters (electrode position, stimulation mode, pulse amplitude and duration) were generated and transmitted via inductive coupling to the implanted receiver. In addition to the generation of stimulus parameters for the cochlear implant, an acoustic signal based an a perceptive model of auditory nerve stimulation was output simultaneously. Two main processing strategies were implemented an this system. The first approach (PES, pitch excited sampler) is based an the maximum peak channel vocoder concept whereby the time-averaged spectral energies of a number of frequency bands (approximately third octave bands) are transformed into appropriate electrical stimulation parameters for up to 22 electrodes (fig. 4, left). The pulse rate at any electrode is controlled by the voice pitch of the Input speech signal. A pitch extractor algorithm calculates the autocorrelation function of a low-pass-filtered segment of the speech signal and

PULSE WIDTH (usec)

21-T 100

21-C x- 16-T - 16-C -x- 11-T

10

-9- 11-C 6-T ▪ 6-C 1000

100 CURRENT AMPLITUDE (NA)

b (100

psec, BP)

Subj. Loudness

1.0 0.8 0.6 0.4 0.2 0 .0

Fig. 2. Psychophysical measurement data for subject SA. a Thresholds of hearing (T level, T) and comfortable listening level (C level, C). b Combinations of pulse width versus amplitude values for T and C levels measured with 4 different bipolar electrode pairs. c Loudness growth functions measured for 8 different bipolar electrode (EL.) pairs. The stimulation mode was bipolar (BP) for all measurements.

302

Dillier/Bögli/Spillmann

1000

100 Amplitude (uA) (100 psec,

c

BP)

21

-4>- EL. 16

x EL. 12

n EL. 6

EL. 18

EL. 14

EL. 10

-1" EL. 2

1 EL.

Digital Speech Processing

Acoustic

Loudspeake

Model

Digital speech signal



Power spectrum Feature

Encoding for

extraction

Transmission

Transmitter Receiver

lmplanted Electrodes

Fig. 3. Digital signal processing steps.

0 dB

j

-20 dB

NCL -40 dB

/a/

Power Spectrum

0

nnnnnnnnnnn nnnnn

n

n

0

5 kHz

6 spectral peaks

Electrodes

PES

u

11

11

11

1.1

All bands above NCL

n

11

II

CIS-NA

Electrodes

0

0

1.I

Lri

IJ 11

u

11

0 11

n

n

0

n0

11

nn n IJ

u

11

61

u

22

22 Tp

II I

0

Tp

Tp

Tp

Time

Ts

u Ts

Tim°

Fig. 4. Schematic display of PES- and CIS-NA-coding strategies. The Power spectrum (64 frequency points) is divided into 22 frequency bands. Six spectral peaks are selected and mapped to 6 electrodes for the PES strategy, all energy values above a preset noise cut level (NCL) are mapped to corresponding electrodes for the CIS-NA strategy. Tp = Pitch period; Ts = stimulus rate period.

303

% CORRECT RESPONSES (CL corrected) 100 -" 90 80 70 60 50 Fig. 5. Summary

of vowel and consonant identification test results. Average total percentages of correct responses with four different processing strategies. Scores were corrected for chance level as follows: S = (RCL)/(1 00—CL), where R = raw score (%), CL = chance level (%). • = WSP/MSP; 'I A PES; = CIS-NA; EI= CIS-WF.

40 30 20 10 0 8 Vowels

searches for a peak within a specified time lag interval. A random pulse rate of about 150-250 Hz is used for unvoiced speech portions. The second approach (CIS, continuous interleaved sampler) uses a stimulation pulse rate which is independent of the fundamental frequency of the input signal. The algorithm scans continuously all frequency bands and samples their energy levels (fig. 4, right). As only one electrode can be stimulated at any instance of time, the rate of stimulation is limited by the required stimulus pulse widths (determined individually for each subject) and the time to transmit additional stimulus parameters. As the information about the electrode number, the stimulation mode, the pulse amplitude and width is encoded by high frequency bursts (2.5 MHz) of different durations, the total transmission time for a specific stimulus depends on all of these parameters. This transmission time can be minimized by choosing the shortest possible pulse width combined with the maximal amplitude. For very short pulse durations, the overhead imposed by the transmission of the fixed stimulus parameters can become rather large. Consider for example the stimulation of electrode pair (21, 22) at 50 ps. The maximally achievable rate varies from about 3,600 Hz for high amplitudes to about 2,700 Hz for low amplitudes whereas the theoretical limit would be dose to 10,000 Hz (biphasic pulses with minimal interpulse interval). In cases with higher pulse width requirements (which may be due to poor nerve survival or unfavorable electrode position or other unknown factors), the overhead will become smaller. In order to achieve maximally high stimulation rates for those portions of the speech input signals which are assumed to be most important for intelligibility, several modifications of the basic CIS strategy were designed of which only the two most promising will be considered in the following. The analysis of the short time spectra was performed either for a large number of narrow frequency bands (corresponding directly to the number of available electrodes) or for a small number (typically 6) of wide frequency bands analogous to the approach suggested by Wilson et al. [2]. The frequency bands were logarithmically spaced from 200 to 5,000 Hz in both cases. Spectral energy within any of these frequency bands was mapped to stimulus amplitude at a selected electrode as follows: all narrow band analysis channels whose values exceeded a noise cut level were used

304

Dillier/Bögli/Spillmann

12 Consonants

for CIS-NA whereas all wide band analysis channels irrespective of the noise cut level were mapped to preselected fixed electrodes for CIS-WF. Both schemes are supposed to minimize electrode interactions by preserving maximal spatial distances between subsequently stimulated electrodes. In both the PES and the CIS strategies, a high frequency preemphasis was applied whenever a spectral gravity measure exceeded a preset threshold.

Results Systematic investigations of these processing strategies with 5 subjects were performed which are still ongoing. Figure 5 summarizes the results for consonant and vowel identification tests. The average scores for consonant tests with the subjects own wearable speech processor were significantly lower than with the new CIDSP strategies. The pitch-synchronous coding (PES) resulted in worse performance compared to the coding without explicit pitch extraction (CIS-NA and CIS-WF). Vowel identification scores on the other hand did not improve by modifications of the signal processing strategy. A more detailed analysis of the consonant tests is shown in figure 6 for all subjects. The results (12 consonants in /aCa/ context, at least 144 trials per condition) are presented as percentages of information transmitted according to the method described by Miller and Nicely [5] for the phonological features voicing, sonorance, sibilance, frication and place of articulation as listed in table 2. The analysis of the confusion matrices revealed a rather complex pattern across subjects, conditions and speech features. Overall information was best transmitted with the CIS-NA strategy (except for subject T.H., who

Digital Speech Processing

Transmitted Information (%)

100 80



60



40



20



-\

/

7\

OVERALL

VOI

7\ NAS

SIB

SON

a

PLC

FRI

Transmitted Information (%)

100 80 60 40 20 7\

0

OVERALL

/\ VOI

NAS

b

SON

SIB

FRI

PLC

Transmitted Information (%) 100 —

1'

80 60 — Fig. 6. Information transmission analysis for 12-consonant identification test confusion matrices. Four different signal processing strategies. a Subject U.T. b Subject T.H. c Subject H.S. VOI = Voicing; SON = sonoranze; SIB = sibilance; FRI = frication; PLC = place of articulation; NAS = nasality. = MSP, except in a (WSP); = PES; 3 = CIS-NA; Ej= CIS-WF.

40 —

20 7\ OVERALL VOI

NAS

SON

SIB

FRI

c PLC

305

Transmitted Information (%) 100

80 — 60 — 40 — 20 —

Z

d

4

OVERALL VO

NAS

SON

SIB

FRI

PLC

Transmitted Information (%) 100

7

80

60 —

40 -

20 -

0

7\

7\

OVERALL VOI

/\

e

NAS SON SIB FRI PLC

Fig. 6 (continued). d Subject S.A. e Subject K.W.

scored slightly higher with CIS-WF). Improvements of 40% (U.T.) and 20% (T.H. and H.S.) could be observed relative to the subjects own wearable speech processors. It g mn 1 r f s can be seen in figure 6 that these subjects performed significantly better with at least some of the new CIDSP + + + + + - strategies than with their own wearable speech processors. + + No significant improvement with either PES nor CIS was + + + + + noted for subject K.W. The best transmitted speech feature for most subjects and strategies was sonorance. The + + largest improvements with CIS for U.T. and T.H. were 3 1 2 2 3 2 2 achieved for sibilance and frication whereas the other subjects showed either moderate improvement or even worse performance for these high frequency features with CIS compared to their own wearable MSP. Considerable

Table 2. Consonant phoneme features Phoneme p t k b d Voicing

- - - + +

Nasality Sonorance

- + +

Sibilance Frication Place

306

1 2 3 1 2

Dillier/Bögli/Spillmann

Digital Speech Processing

improvements in transmitted voicing information was observed for all subjects except K.W. with CIS although this processing mode does not explicitly encode this feature. The improvements for place of articulation transmission finally (U.T., T.H., S.A.) may indicate that increased stimulation rates are indeed more effective in signalling formant transitions which distinguish phonemes articulated at different vocal tract positions.

Discussion and Conclusions The above speech test results should be regarded as preliminary. The number of subjects is still very small, and data collection has not yet been completed for all of them in every processing condition. It is however very promising at this point that new Signal processing strategies can improve speech discrimination considerably. Consonant identification apparently may be enhanced by more detailed temporal information and specific speech feature transformations. Whether

these improvements will pertain in the presence of interfering noise also remains to be verified. Further optimization of these processing strategies should preferably be based an more specific data about loudness growth functions for individual electrodes or additional psychophysical measurements. Although many aspects of speech encoding can be efficiently studied using a laboratory DSP it would be desirable to allow subjects more time for adjustment to a new coding strategy. Several days or weeks of habituation are sometimes required until a new mapping can be fully exploited. Thus for scientific as well as practical purpose, the miniaturization of wearable DSPs will be of great importance. Acknowledgements This work was supported by the Swiss National Research Foundation (grants No. 4018-10864 and 4018-10865). Implant surgery was performed by Prof. U. Fisch. Valuable help was also provided by Dr. E. von Wallenberg of Cochlear AG, Basel, Switzerland.

References 1 Clark GM, Tong YT, Patrick JF: Cochlear Prostheses. Edinburgh, Churchill Livingstone, 1990. 2 Wilson BS, Lawson DT, Finley CC, Wolford RD: Coding strategies for multichannel cochlear prostheses. Am J Otol 1991;12:(suppl 1): 55-60.

3 Skinner MW, Holden LK, Holden TA, Dowell 4 RC, et al: Performance of postlingually deaf adults with the wearable speech processor (WSP III) and mini speech processor (MSP) of the Nucleus multi-electrode cochlear implant. Ear Hear 1991;12:3-22. 5

Dillier N, Senn C, Schlatter T, Stöckli M, Utzinger U: Wearable digital speech processor for cochlear implants using a TMS320C25. Acta Otolaryngol Suppl (Stockh), 1990;469: 120-127. Miller GA, Nicely PE: An analysis of perceptual confusions among some English consonants. J Acoust Soc Am 1955;27:338-352.

307