Neural Coding of Phonemic Fricative Contrast ... - Zhang Lab in SLHS

0 downloads 0 Views 881KB Size Report
Each stimulus was presented 40 times with an interstim- ulus interval of 1.5 seconds. The order of aided and unaided conditions was counterbalanced across ...
Neural Coding of Phonemic Fricative Contrast With and Without Hearing Aid Sharon Miller,1 and Yang Zhang1,2 Objective: To determine whether auditory event-related potentials (ERPs) to a phonemic fricative contrast (“s” and “sh”) show significant differences in listening conditions with or without a hearing aid and whether the aided condition significantly alters a listener’s ERP responses to the fricative speech sounds.

& Turner 1990). Neural coding of the critical frication spectrum cue and the brain mechanisms that underlie fricative perception in various listening conditions remain poorly understood. The present EEG study examined NH listeners to determine whether the spectral frication cues of the English /s/-/∫/ contrast are differentially coded and whether the use of a hearing aid would alter neural coding of the voiceless fricatives. Auditory evoked responses and event-related potentials (ERPs) have been widely adopted to study the neural processing of complex speech and nonspeech stimuli in the auditory system (Näätänen et al. 2004).* Auditory ERPs represent a noninvasive technique for measuring synchronous postsynaptic cortical activity and are obtained by averaging EEG epochs time-locked to repeated stimulus presentations (Näätänen & Winkler 1999; Key et  al. 2005). The ERP technique features temporal resolution on the order of milliseconds, making it a useful tool for studying the time course of rapid neural processing of speech stimuli (Martin et al. 2008). Previous studies have indicated that the P1-N1-P2 components of the ERP response or their magnetic counterparts measured in magnetoencephalography (MEG) reflect neural encoding of the critical acoustic features that define various consonant and vowel categories such as voice onset time (Sharma et al. 2000; Zaehle et al. 2007; Digeser et al. 2009), place of articulation (Tavabi et al. 2007), and manner of articulation (Hari 1991; Zhang et al. 2005). The first goal of this study was to determine whether the P1-N1-P2 complex for the frication portion of the consonant– vowel (CV) syllables /sa/ and /∫a/ differed in listeners with NH. The English voiceless sibilant contrast /s/-/∫/ differs in place of articulation and peak spectral energy; the alveolar /s/ usually has spectral peak energy around 4 to 8 kHz and the palatoalveolar /∫/ tends to contain spectral peak energy around 2 to 5 kHz (Ladefoged 1962; Stevens 1998). Despite the existence of salient and distinct spectral cues, studies using the /s/-/∫/ contrast have not found definitive evidence in the P1-N1-P2 complex that could statistically differentiate the two sounds. For instance, Agung et al. (2006) recorded P1-N1-P2 responses to a number of speech sounds varying in spectral energy, which included /m/, /u/, /a/, /ɔ /, /i/, /∫/, and /s/. Relative to the sounds dominated by lower frequencies (/m,u,a,i,ɔ/), the fricatives /s/ and /∫/ elicited significantly smaller and later peak N1-P2 amplitudes. While the data showed evidence that auditory ERPs were sensitive to spectral differences in the speech stimuli, the P1-N1-P2 complex elicited by /s/ did not significantly differ from the P1-N1-P2 response elicited by /∫/. Tremblay and colleagues (2003) also investigated whether the naturally produced fricative-vowel syllables /si/ and /∫i/ could produce reliably different evoked potential responses in NH listeners. Although the N1 and P2 peaks for /s/ appeared larger than those elicited by /∫/, the /s/-/∫/ differences did not achieve statistical significance. Their ERP data further indicated the presence of a pair of sequential N1-P2 complexes, reflecting the time lag differences

Design: The raw EEG data were collected using a 64-channel system from 10 healthy adult subjects with normal hearing. The fricative stimuli were digitally edited versions of naturally produced syllables, /sa/ and /∫a/. The evoked responses were derived in unaided and aided conditions by using an alternating block design with a passive listening task. Peak latencies and amplitudes of the P1-N1-P2 components and the N1’ and P2’' peaks of the acoustic change complex (ACC) were analyzed. Results: The evoked N1 and N1’ responses to the fricative sounds significantly differed in the unaided condition. The fricative contrast also elicited distinct N1-P2 responses in the aided condition. While the aided condition increased and delayed the N1 and ACC responses, significant differences in the P1-N1-P2 and ACC components were still observed, which would support fricative contrast perception at the cortical level. Conclusion: Despite significant alterations in the ERP responses by the aided condition, normal-hearing adult listeners showed distinct neural coding patterns for the voiceless fricative contrast, “s” and “sh,” with or without a hearing aid. Key words: Acoustic change complex, Event-related potential, Fricative, Hearing aid. (Ear & Hearing 2014;35;e122–e133)

INTRODUCTION The perception of fricative speech sounds depends on timevarying spectral cues. For English voiceless fricatives, recognition primarily depends on the spectral shape of the frication noise and the dynamic formant transitions between the fricative and vowel (Hughes & Halle 1956; Harris 1958; Heinz & Stevens 1961; Zeng & Turner 1990; Pittman & Stelmachowicz 2000; Hedrick & Younger 2003). Listeners with normal hearing (NH) can not only achieve sufficient voiceless fricative recognition with only steady state frication cues, but they can also use dynamic formant transition cues to enhance place of articulation information, especially at low presentation levels (Zeng & Turner 1990; Hedrick & Younger 2003). On the contrary, when controlling for audibility, hearing-impaired listeners rely mainly on frication spectrum cues to recognize voiceless fricatives and are relatively poor at using the dynamic formant transitions (Zeng 1 Department of Speech-Language-Hearing Sciences, and 2Center for Neurobehavioral Development, University of Minnesota, Minneapolis, Minnesota, USA. *According to Näätänen’s definition, the long-latency cortical auditory evoked potential responses such as P1, N1, and P2 are referred to as ERPs in this article. Supplemental digital content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and text of this article on the journal’s Web site (www.ear-hearing.com).

0196/0202/2014/354-0e122/0 • Ear & Hearing • Copyright © 2014 by Lippincott Williams & Wilkins • Printed in the U.S.A. e122



MILLER AND ZHANG / EAR & HEARING, VOL. 35, NO. 4, e122–e133

between the onset of the consonant and the onset of the vowel for the fricative-to-vowel transition. Such double-peaked response patterns were shown in other EEG and MEG studies for speech as well as nonspeech involving distinct acoustic change or transition within the stimulus (Kaukoranta et al. 1987; Hari 1991; Sharma et al. 2000; Zhang et al. 2005), and some EEG researchers named the phenomenon “the acoustic change complex” (ACC) ( Ostroff et al. 1998; Martin & Boothroyd 1999). As the formant transition is an important cue for consonant identification, it remains to be investigated whether the ACC responses would reflect the /s/-/∫/ distinction when the duration of the fricative portion is controlled (Supplemental Digital Content 1, http://links.lww.com/EANDH/A134). The second goal of this study was to introduce hearing aid signal enhancement in the stimuli and investigate whether the neural responses would faithfully reflect the perception of the altered stimuli. Hearing aids are commonly prescribed to improve speech audibility in persons with hearing loss. Given the importance of the frication spectrum in identifying voiceless fricatives in NH and hearing-impaired listeners (Zeng & Turner 1990), it is important to determine whether neural responses to the frication spectra of the /s/- /∫/ contrast also differ with the use of a hearing aid (Supplemental Digital Content 2, http:// links.lww.com/EANDH/A135). To date, only a few studies have recorded ERPs in NH listeners wearing hearing aids (Tremblay et al. 2006; Billings et al. 2007, 2011). NH listeners were used in the studies to avoid confounding factors from various degrees and conditions of hearing loss. The data indicated that ERPs could be reliably recorded to the voiceless fricative-vowel stimuli when listeners wore hearing aids (Tremblay et al. 2006) and that the distinct ACC patterns elicited by /si/ and /∫i/ were preserved. However, the initial P1-N1-P2 responses for the /s/ and /∫/ phonemic contrast were not found to significantly differ. Furthermore, hearing aid amplification did not result in significant changes in the N1 response for /s/ or /∫/ at the group level. Billings et  al. (2007) found consistent ERP results showing the lack of hearing aid amplification (approximately 20 dB) effects in NH listeners at both low and high intensity levels. This finding was attributed to the signal processing of the hearing aid and how the central auditory system dealt with the altered input, including the amplification of low-level environmental or circuit noise by the hearing aid. However, the reported data also showed very large individual differences in terms of the effects of hearing aid amplification, and significant differences were observed in the neural responses for a 20 dB increase in stimulus intensity level without the hearing aid. There are at least two questions that merit further study. First, why would the ERP responses not reflect the large spectral differences in /s/ and /∫/ that support clear behavioral discrimination of the sounds in either an aided or unaided condition? Second, why would the use of a hearing aid result in no changes in the neural responses, assuming there is improved signal-tonoise ratio (SNR) for the clearly discriminable stimuli in the aided condition? One important issue here could be stimulus characteristics. To focus more specifically on how frication cues affect the ERPs, we tightly controlled the physical duration of our speech stimuli by equating the fricative and vowel durations and using nonsense syllables, /sa/ and /∫a/. Another issue is that stimulus presentation strategies can significantly affect the neural responses (Martin et  al. 2010). Recently, Zhang et  al. (2011) successfully used a new alternating short

e123

block recording protocol to examine neural coding of speech sounds in infants. This presentation protocol was designed to highlight the spectral contrast in the speech stimuli and reduce neural adaptation or habituation common to long presentation blocks that contain identical stimuli (Woods & Elmasian 1986; Dehaene-Lambertz & Dehaene 1994; May & Tiitinen 2010). For the present study, we were interested in testing whether this alternating block recording paradigm might be sensitive enough to show the /s/-/∫/ distinction as well as the effects of listening condition (unaided versus aided). The basic assumption of our study was that neural representations of speech sounds would reflect the spectral cue differences in the /s/-/∫/ contrast. We hypothesized that ERP responses to the fricative speech contrast would show significant differences in both aided and unaided conditions. We also expected to see effects of listening condition in the N1-P2 and ACC responses.

SUBJECTS AND METHODS Subjects Ten (5 female, 5 male) right-handed adult listeners, ranging in age from 19 to 27 years of age, participated in the study. All participants were native speakers of American English, had NH, and reported no history of speech, language, or cognitive impairments. Informed consent was obtained in compliance with the institutional Human Research Protection Program at the University of Minnesota. All subjects passed a screening of a 1000 Hz tone at 20 dB HL.

Stimuli The speech stimuli were 350 msec nonsense CV syllables, “sa” [sa] and “sha” [∫a]. The present design strictly controlled the duration parameters for the fricative consonant and vowel portions because evoked response peak latencies are directly related to the duration of the critical acoustic cues in the stimuli (Sharma et al. 2000; Zhang et al. 2005). The fricative portion of each stimulus was 150 msec and the vowel portion was 200 msec. The stimuli were digitally edited in Sony Sound Forge 9.0 (Sony Creative Software, Middleton, WI) using naturally recorded speech. A native female speaker of American English produced the speech syllables three times each. The stimuli were spoken into a Sennheiser high-fidelity microphone in a sound booth (ETS-Lindgren Acoustic Systems, Cedar Park, TX), and the speech tokens were digitally recorded to disk (44.1 kHz). The stimuli were then picked and the fricative and vowel durations were equated separately by applying temporal stretching and shrinking using the pitch synchronous overlap-add technique (Moulines & Charpentier 1990). The stimuli were then equated for root mean square (RMS) intensity level. The digital editing did not affect the intelligibility of the speech sounds as confirmed in behavioral tests (see behavioral data in Results).

ERP Stimulus Presentation Protocol Participants were seated on a comfortable chair in an electrically and acoustically treated room (ETS-Lindgren Acoustic Systems). Stimuli were presented in a free sound field using EEVoke software (ANT Inc., Enschede, The Netherlands) via bilateral loudspeakers (M-audio BX8a). The loudspeakers were placed at approximately 60-degree azimuth angle to each participant. The sound level of the stimuli was calibrated to 60

e124

MILLER AND ZHANG / EAR & HEARING, VOL. 35, NO. 4, e122–e133

Fig. 1. Schematic illustration of stimulus presentation protocol with alternating blocks. ISI was randomized between 900 to 1000 msec. ISI indicates interstimulus interval.

dB SPL at the subject’s head for the unaided condition. The stimuli were presented in both aided and unaided conditions with the order of presentation blocks counterbalanced among listeners. In the aided condition, a behind-the-ear hearing aid was coupled to each listener’s right ear by using a foam earpiece. A foam ear plug was placed in the unaided ear during the aided recordings so that the ERP responses would properly reflect the auditory input from the hearing aid processor for the aided listening condition. The unaided condition did not implement the placement of an earplug so that both the aided and unaided listening conditions would simulate a more naturalistic experience. It is important to note that this experimental setup confines the focus of our study to determining whether the ERP responses to /sa/ and /∫a/ differ with and without a hearing aid. Any observed differences between the aided and unaided conditions would be a composite reflection of amplification/signal processing at the different frequency bands of the auditory input and different setups for the EEG recordings regarding the use of the earplug. The study used a passive listening with an alternating short block design (Zhang et al. 2011) (Fig. 1). Each block contained 20 stimuli of one sound category followed by a block that contained stimuli of a different category. The blocks were sequentially alternated to collect sufficient trials for both stimuli with an equal stimulus ratio. During the experiment, participants watched a muted movie of their choice on a 20-in LCD TV located approximately 2.5 m from the listener. The entire EEG recording session lasted approximately 60 min. The interstimulus interval (offset to onset) was randomized between 900 to 1000 msec. The interblock silence period was 2 seconds, and there were at least 120 recorded trials of each stimuli. To exclude any mismatch negativity response from the alternating blocks, the first trial of every block was excluded from averaging.

Hearing Aid Description According to the manufacturer specifications, the 12-channel digital behind-the-ear hearing aid used had a frequency range of 200 to 6400 Hz, a peak full-on gain of 60 dB SPL, and

a high-frequency average full-on gain of 54 dB SPL. The hearing aid was programmed to be omnidirectional with output limiting compression and roughly 20 dB of gain. The hearing aid used multichannel compression and the average compression ratio was 1.125:1 across channels. As recommended by the manufacturer, the hearing aid was set to maximize speech intelligibility by using fast time constants (1 to 10 msec attack time) and higher knee points in the low-frequency compression channels (threshold kneepoint [TK] = 50 dB) relative to the higher-frequency channels (TK = 30 dB). The hearing aid had adaptive feedback and noise reduction mechanisms, but these were deactivated for the recordings. The hearing aid’s digital signal processing delay was measured to be 4.5 msec. The hearing aid’s real ear insertion gain (real ear unaided gain subtracted from the real ear aided gain) to a 60 dB SPL digital speech stimulus was verified using Knowles Electronics Manikin for Acoustics Research (KEMAR) (G.R.A.S. Sound and Vibration, Holte, Denmark) (Fig. 2). The hearing aid settings were electroacoustically verified in a 2-cm3 coupler (Fonix 7000; Frye Electronics Inc.,Tigard, OR) before each session.

In-the-Canal Recordings Sound field in-the-canal recordings of the /sa/ and /∫a/ stimuli were made using KEMAR (G.R.A.S. Sound and Vibration) in both an unaided and aided conditions (Fig. 3). The outputs from the internal microphones in KEMAR were routed to the sound card (Gina; Echo Audio, Santa Barbara, CA) of a personal computer via an external audio interface and recorded to Audacity. Further analysis was done using Praat (Boersma & Weenink 1999) and MATLAB (Mathworks, Version 8.0). The aided and unaided SNRs were derived from in-the-canal recordings (broadband A weighting was applied to the noise and speech waveforms). The background noise was characterized by windowing the preceding 300 msec before the onset of the speech stimulus. RMS amplitudes were computed for each segment. Table 1 describes the dB SPL intensity levels of the fricative and vowel portions of the stimuli measured in the canal in the aided and unaided conditions.

Fig. 2. Real ear insertion gain graph for the hearing aid in dB SPL. The overall real ear insertion gain, or the real ear aided gain minus the real ear unaided gain, to a 60 dB SPL speech input measured in Knowles Electronics Manikin for Acoustics Research (KEMAR).



MILLER AND ZHANG / EAR & HEARING, VOL. 35, NO. 4, e122–e133

e125

baseline. Trials containing artifacts that exceeded ±50 μV were removed. After artifact rejection, data were band-pass filtered from 0.5 to 40 Hz for averaging. An average of 112 trials remained after artifact rejection across subjects in the aided and unaided conditions. Peak amplitudes and latencies for the P1, N1, and P2, elicited by the fricative, and the ACC, elicited by the CV transition to the following vowel, were extracted from the averaged waveforms from each subject. On the basis of the grand mean ERP waveforms, the following latency ranges were used in extracting the peaks elicited by the fricative consonant in the CV stimuli: P1 35 to 80 msec; N1 85 to 170 msec; P2 165 to 245 msec. For the ACC peaks to the CV transition and vowel, the latency ranges were as follows: N1’, or the initial negative peak of the ACC, ranged from 240 to 310 msec and P2’, the initial positive peak of the ACC, ranged from 300 to 380 msec.

Statistical Analysis

Fig. 3. Unaided and aided waveforms of the /sa/ and /∫a/ in-the-canal acoustic recordings using KEMAR in relative amplitude. Corresponding spectrograms from the in-the-canal recordings are displayed below the waveforms.

EEG Data Acquisition Continuous EEG activity was recorded using the Advanced Neuro Technology EEG system and a 64 channel Waveguard Cap (ANT, Inc., Enschede, Netherlands) (Rao et al. 2010). The EEG data were band-pass filtered (0.016 to 200 Hz) and digitized using a sampling rate of 512 Hz. The Ag/AgCl electrodes on the cap were arranged in the standard 10–20 system with additional intermediate positions. The ground electrode was located at the AFz position. The average electrode impedance was below 5 kohm, and impedances were checked and adjusted with additional electrode gel before each recording condition.

ERP Waveform Analysis ERP averaging was performed off-line with linkedmastoid reference using the Advanced Neuro Technology EEG System (Advanced Source Analysis version 4.7) and further analyzed in MATLAB. The ERP epoch contained a 700 msec recording window and a 100 msec prestimulus TABLE 1.  Mean intensity in dB SPL (A weighted) for the fricative and vowel portions of the /sa/ and /∫a/ stimuli measured in the canal with and without the hearing aid using KEMAR Unaided

Fricative Vowel

Aided

/sa/

/∫a/

/sa/

/∫a/

69.9 (± 3.1) 69.5(±1.6)

71.8 (±2.1) 69 (±1.9)

69.13 (±0.91) 79.4 (±1.8)

73.7 (±0.88) 79.9 (±1.5)

Standard deviations of the mean are in parentheses.

Effects of the phonetic identity (/s/ versus /∫/) and listening condition (unaided versus aided) on peak amplitudes and latencies from the individual data were assessed using repeated-measures analysis of variance (ANOVA). Post hoc repeated-measures univariate ANOVAs were performed on all significant main effects. To examine region effects (frontal, central, parietal, midline frontal, midline central, and midline parietal), the electrodes were grouped for the statistical analysis (Rao et al. 2010; Zhang et al. 2011) and electrode region was included as a between-subjects factor in the ANOVA (6 levels). The electrode groups used in the analysis were as follows: The frontal electrodes included F3, F5, F7, FC3, FC5, FT7, F4, F6, F8, FC4, FC6, and FT8. The central electrodes included T7, TP7, C3, C5, CP3, CP5, T8, TP8, C4, C6, CP4, and CP6. Parietal electrodes included P3, P5, P7, PO3, PO5, PO7, P4, P6, P8, PO4, PO6, and PO8. The midline frontal electrodes included F1, Fz, F2, FC1, FCz, and FC2. The midline central electrodes included C1, Cz, C2, CP1, CPz, and CP2. Midline parietal electrodes included P1, Pz, P2, and POz. To be consistent with our previous publications that examined hemispheric effects (Rao et  al. 2010; Zhang et  al. 2011), an initial ANOVA was performed on the nonmidline frontal, central, and parietal electrodes across right and left electrode sites and midline electrodes were excluded from the analysis. However, because no significant hemispheric effects or interactions were found, the hemisphere factor was excluded from our reported ANOVA model and we collapsed the nonmidline left, and right electrodes for the frontal, temporal, and parietal sites, respectively. Our electrode grouping choice took into account recommendations for application of repeated-measures ANOVA to high-density ERP data (Dien & Santuzzi 2005; Luck 2005). To avoid complications for data interpretation due to the disparity in the number of electrodes for the grouped sites, significant electrode effects were further examined by performing a repeated-measures ANOVA with the listening condition and fricative identity factors on each individual electrode group separately. Where applicable, Bonferroni or Greenhouse-Geisser corrections were applied to the reported p values. Global field power (GFP) differences to the fricative contrast were analyzed in separate unaided and aided analyses to validate the ERP waveform analysis. GFP is a measure of response strength, independent of electrode site, and quantifies

e126

MILLER AND ZHANG / EAR & HEARING, VOL. 35, NO. 4, e122–e133

the standard deviation of potential values across all electrodes at each sampling point in the recording epoch (Lehmann & Skrandies 1984). To obtain detailed information about the temporal evolution of significant differences between the /sa/ and /∫a/ stimuli, we assessed the GFP data using a point-topoint analysis. In this analysis, effects for the phonemic contrast were examined for every time point of the ERP data in the poststimulus time window at the same latency point across subjects. To compute GFP, z scores were derived by comparing the poststimulus RMS differences between /sa/ and /∫a/ at each sampling point (600 msec poststimulus response from the grand mean waveforms) relative to the distribution of RMS amplitude difference to the fricative contrast in the 100 msec prestimulus baseline (Rao et  al. 2010). Bonferroni-corrected p values for the point-by-point differences to the fricative contrast were calculated in the aided and unaided conditions. Only pointby-point differences (p < 0.01) lasting at least 10 consecutives samples (approximately 20 msec) were considered significant (Zhang et al. 2011).

Behavioral Protocol Half of the subjects who completed the ERP portion of the study also completed a behavioral identification test of the speech stimuli in a randomized block to ensure that they could hear the clear distinction between /sa/ and /∫a/. The behavioral identification was completed with and without the hearing aid using identical stimulus presentation settings as the ERP experiment. Each stimulus was presented 40 times with an interstimulus interval of 1.5 seconds. The order of aided and unaided conditions was counterbalanced across the listeners.

RESULTS Behavioral Data The behavioral results indicated that listeners easily identified /sa/ and /∫a/ with and without the hearing aid. All subjects achieved 100% correct identification for /sa/ in the unaided and aided conditions and for /∫a/ in the unaided condition. For aided /∫a/ identification, all but 1 subject, who scored 97.5% correct, achieved 100% correct identification.

ERP Data Clear P1-N1-P2 components and ACC responses for the speech stimuli were observed across all electrode regions in both unaided (Fig.  4) and aided (Fig.  5) listening conditions. The grand mean peak amplitude, latency, and standard deviations, averaged across the six electrode regions, used in the statistical analysis for each ERP component of interest are summarized in Table  2. Separate repeated-measures ANOVAs for P1, N1, P2, N1’, and P2’ peak latencies and amplitudes were performed; Table 3 summarizes the full model ANOVA results.

N1-P2 Responses to the Fricative Onset Repeated-measures ANOVA for N1 amplitudes indicated a significant interaction between fricative identity and listening condition (F(1,84)  =  12.07, p < 0.01) and a significant main effect of listening condition (F(1,84)  =  6.51, p < 0.01). The electrode region factor just failed to reach significance (F(5,84) = 2.81, p = 0.06). Post hoc tests indicated that unaided N1 peak amplitudes for /s/ were significantly greater than for /∫/ (F(1,84) = 6.87, p < 0.01). In the aided condition, /s/ and /∫/

Fig. 4. Unaided grand mean waveforms for the selected electrode regions (negative plotted up for the event-related potential signal).



e127

MILLER AND ZHANG / EAR & HEARING, VOL. 35, NO. 4, e122–e133

Fig 5. Aided grand mean waveforms for the selected electrode regions (negative plotted up for the event-related potential signal).

elicited significantly different N1 amplitudes as well; however, with the hearing aid, N1 amplitudes were larger for /∫/ than /s/ (F(1,84) = 6.54, p < 0.01). Repeated-measures ANOVA for P2 amplitudes revealed significant main effects of fricative identity (F(1,84) = 9.509, p