I would also like to thank Alan Moore and Linda Berg for their support and ...... see Chertoff et al., 1992; Krishnan, 2002; Pandya and Krishnan, 2004; Rickman et.
Human Brain Responses to Speech Sounds
by
Steven James Aiken
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Institute of Medical Science University of Toronto
© Copyright by Steven James Aiken 2008
Human Brain Responses to Speech Sounds Steven James Aiken Doctor of Philosophy Institute of Medical Science University of Toronto 2008
Abstract Electrophysiologic responses are used to estimate hearing thresholds and fit hearing aids in young infants, but these estimates are not exact. An objective test of speech encoding could be used to validate infant fittings by showing that speech has been registered in the central auditory system. Such a test could also show the effects of auditory processing problems on the neural representation of speech. This thesis describes techniques for recording electrophysiologic responses to natural speech stimuli from the brainstem and auditory cortex. The first technique uses a Fourier analyzer to measure steady-state brainstem responses to periodicities and envelope changes in vowels, and the second uses a windowed cross-correlation procedure to measure cortical responses to the envelopes of sentences. Two studies were conducted with the Fourier analyzer. The first measured responses to natural vowels with steady and changing fundamentals, and changing formants. Significant responses to the fundamental were detected for all of the vowels, in all of the subjects, in 19 – 73 s (on average). The second study recorded responses to a vowel fundamental and harmonics. Vowels were presented in opposite polarities to distinguish envelope responses from responses to the spectrum. Significant envelope responses were detected in all subjects at the fundamental. Significant spectral responses were detected in most subjects at harmonics near formant peaks. The third study used cross-correlation to measure cortical responses to sentences. Significant envelope responses were detected to all sentences, at delays of roughly 180 ms. Responses were localized to the posterior auditory cortices. A model based on a series of overlapping transient responses to envelope changes could also account for the results, suggesting that the cortex either directly follows the speech envelope or consistently reacts to changes in this envelope. The strengths and weaknesses of both techniques are discussed in relation to their potential clinical applications. ii
Acknowledgments I would like to thank my supervisor, Dr. Terry Picton, for his encouragement, patience, and invaluable guidance. This thesis could not have existed without his tireless work. I would also like to thank the members of my supervisory committee, Dr. Bernhard Ross, and Dr. Robert Harrison, for their thoughtful questions and helpful advice. I am extremely grateful for the help of Patricia Van Roon, who provided generous technical assistance and support. There are many others who have made significant contributions to this thesis, through helpful comments, advice, and technical assistance. They are, in no particular order, Sasha John, Anthony Shahin, David Purcell, Ali Mazaheri, Kelly McDonald, Claude Alain and Kelly Tremblay. There are others who have undoubtedly been missed, for which I apologize. I would also like to thank Alan Moore and Linda Berg for their support and encouragement. And most importantly, I would like to thank my family. I thank my parents, Al and Lynda, who taught me to believe in myself. I thank my two beautiful daughters, Maya and Asia, who fill every moment of my life with smiles, and have patiently waited for their daddy to finish his “homework.” And I especially thank my beautiful wife and best friend, Jennifer, who has made my life richer than I could have ever imagined.
iii
Table of Contents
Acknowledgments.......................................................................................................................... iii Table of Contents ........................................................................................................................... iv List of Tables ................................................................................................................................. ix List of Figures ................................................................................................................................. x List of Appendices ........................................................................................................................ xii 1 Chapter One: Introduction.......................................................................................................... 1 1.1 Hearing Loss in Infancy ...................................................................................................... 1 1.1.1
Infant Hearing Screening ........................................................................................ 1
1.1.2
Infant Hearing Assessment ..................................................................................... 3
1.1.3
Infant Hearing Aid Fitting ...................................................................................... 5
1.1.4
Validating Infant Hearing Aid Fittings ................................................................... 6
1.2 Relating Speech Perception to Auditory Dysfunction ........................................................ 7 1.3 Techniques for Studying Neural Representation of Speech ............................................... 8 1.4 Purpose of Thesis ................................................................................................................ 9 1.5 Differentiating Speech and Language ............................................................................... 10 1.5.1
Evoked Responses to Language............................................................................ 11
1.5.2
Evoked Responses to Speech ................................................................................ 12
1.5.3
An Agnostic Approach ......................................................................................... 13
1.6 A Closer Look at Speech .................................................................................................. 14 1.6.1
Acoustic Representation of Speech ...................................................................... 14
1.6.2
Speech in the Auditory System ............................................................................. 15
1.7 Electrophysiologic Responses to Speech .......................................................................... 20 iv
1.7.1
Brainstem Responses to Speech............................................................................ 20
1.7.2
Cortical Responses to Speech ............................................................................... 22
1.7.3
Prelude .................................................................................................................. 25
2 Envelope Following Responses to Natural Vowels ................................................................. 26 2.1 Abstract ............................................................................................................................. 26 2.2 Introduction ....................................................................................................................... 27 2.2.1
Acoustic Variability in Natural Speech ................................................................ 28
2.2.2
Analysis of Natural Speech ................................................................................... 29
2.2.3
The Fourier Analyzer ............................................................................................ 31
2.3 Methods............................................................................................................................. 33 2.3.1
Subjects ................................................................................................................. 33
2.3.2
Stimuli ................................................................................................................... 33
2.3.3
Creation of the Reference Sinusoids ..................................................................... 36
2.3.4
Creation of the Fourier Analyzer .......................................................................... 42
2.3.5
Procedure .............................................................................................................. 43
2.3.6
Analysis................................................................................................................. 44
2.4 Results ............................................................................................................................... 45 2.4.1
Experiment 1: Steady Fundamental Frequencies .................................................. 45
2.4.2
Experiment 2: Changing Fundamental Frequencies ............................................. 50
2.4.3
Experiment 3: Changing Vowels .......................................................................... 53
2.4.4
Experiment 4: Effects of Bandwidth and Vowel Identity ..................................... 55
2.4.5
Experiment 5: Envelope or Frequency Following? .............................................. 58
2.5 Discussion ......................................................................................................................... 60 2.5.1
Effects of Vowel Identity ...................................................................................... 60
2.5.2
Envelope or Frequency Following? ...................................................................... 62
2.5.3
Choice of Reference Frequency ............................................................................ 64 v
2.5.4
Limitations ............................................................................................................ 65
2.5.5
Conclusion ............................................................................................................ 66
3 Envelope and Spectral Frequency Following Responses to Vowel Fundamental Frequency, Harmonics and Formants ...................................................................................... 67 3.1 Abstract ............................................................................................................................. 67 3.2 Introduction ....................................................................................................................... 68 3.2.1
Responses to the Fundamental .............................................................................. 70
3.2.2
Responses to Harmonics ....................................................................................... 70
3.2.3
Relationship between Harmonics and Formants ................................................... 72
3.3 Methods............................................................................................................................. 74 3.3.1
Subjects ................................................................................................................. 74
3.3.2
Stimuli ................................................................................................................... 74
3.3.3
Procedure .............................................................................................................. 78
3.3.4
Recordings ............................................................................................................ 79
3.3.5
Analysis................................................................................................................. 81
3.4 Results ............................................................................................................................... 85 3.4.1
Experiment 1: Natural Vowels.............................................................................. 85
3.4.2
Experiment 2: Investigating the Sources of the Harmonic Responses ................. 90
3.5 Discussion ......................................................................................................................... 93 3.5.1
Envelope FFR and Spectral FFR .......................................................................... 93
3.5.2
Contributions from the Cochlear Microphonic ..................................................... 98
3.5.3
Stimulus-Response Relationships ....................................................................... 100
3.5.4
Clinical Implications ........................................................................................... 101
4 Cortical Responses to the Speech Envelope .......................................................................... 103 4.1 Abstract ........................................................................................................................... 103 4.1.1
Objective ............................................................................................................. 103 vi
4.1.2
Design ................................................................................................................. 103
4.1.3
Results ................................................................................................................. 103
4.1.4
Conclusion .......................................................................................................... 104
4.2 Introduction ..................................................................................................................... 105 4.2.1
Brainstem Responses to Speech.......................................................................... 105
4.2.2
Cortical Responses to Speech ............................................................................. 106
4.3 Methods........................................................................................................................... 110 4.3.1
Subjects ............................................................................................................... 110
4.3.2
Stimuli ................................................................................................................. 110
4.3.3
Speech Envelope ................................................................................................. 111
4.3.4
Procedure ............................................................................................................ 113
4.3.5
Recordings .......................................................................................................... 113
4.3.6
Source Analysis .................................................................................................. 114
4.3.7
Cross-Correlations .............................................................................................. 117
4.3.8
Transient Response Model .................................................................................. 121
4.4 Results ............................................................................................................................. 125 4.4.1
Scalp and Source Waveforms ............................................................................. 125
4.4.2
Cross-Correlations with Sentence Envelope ....................................................... 127
4.4.3
Cross-Correlations with Transient Response Model .......................................... 132
4.4.4
Envelope and Transient Response Model Comparisons ..................................... 133
4.5 Discussion ....................................................................................................................... 135 4.5.1
Evoked Potentials to Sentences .......................................................................... 135
4.5.2
Correlations with the Speech Envelope .............................................................. 135
4.5.3
Correlations with the Transient Response Model ............................................... 136
4.5.4
The Nature of the Cortical Response to the Speech Envelope ........................... 137
4.5.5
Efficiency of Response Detection ....................................................................... 139 vii
4.5.6
Summary ............................................................................................................. 141
5 Conclusion.............................................................................................................................. 142 5.1 Envelope and Spectral FFR ............................................................................................ 142 5.2 Cortical Responses to the Speech Envelope ................................................................... 144 5.3 Levels of Validation ........................................................................................................ 144 5.4 Future Directions ............................................................................................................ 145 References ................................................................................................................................... 147 Appendix A: Effects of Response Latency ................................................................................. 178 Copyright Acknowledgements.................................................................................................... 182
viii
List of Tables
Table 2.1
Results of Experiment 1……………………………………………………. 47
Table 2.2
Results of Experiment 2 & 3 at f1 and f0 reference………………………... 51
Table 2.3
Results of Experiment 2 & 3 at f2, f3 ……………………………………... 51
Table 2.4
Results of Experiment 4 at f1 reference……………………………………. 57
Table 2.5
Results of Experiment 5……………………………………………………. 57
Table 3.1
Frequencies of formants and harmonics…………………………………… 77
Table 3.2
Average response nomenclature…………………………………………… 80
Table 4.1
Amplitudes and latencies of onset response ……………………………... 123
Table 4.2
Mean peak correlations and latencies ……………………………………... 130
ix
List of Figures
Figure 2.1
LPC spectra of vowel stimuli……………………………………………… 35
Figure 2.2
Creation of f1 reference sinusoids………………………………………….. 37
Figure 2.3
Creation of f0 reference sinusoids………………………………………….. 39
Figure 2.4
Frequency tracks of f1 reference and adjacent frequencies………………… 41
Figure 2.5
Amplitude and phase of single subject response…………………………... 46
Figure 2.6
Polar plots of subject responses……………………………………………. 49
Figure 2.7
Response amplitude vs. frequency vs. time (/ʌui/)………………………… 54
Figure 2.8
Response amplitude vs. frequency vs. time (/ʌ/)…………………………... 59
Figure 2.9
Simulated response amplitude (Appendix A) ……………………………... 181
Figure 3.1
Spectra of vowels…………………………………………………………... 76
Figure 3.2
FFT and Fourier analysis of /a/……………………………………….......... 82
Figure 3.3
Grand average responses to /a/ (Exp. 1) …………………………………... 87
Figure 3.4
Percentage of subject responses significant……………………………….. 88
Figure 3.5
Grand average responses to /i/ (Exp. 1) …………………………………… 89
Figure 3.6
Grand average responses to /a/ (Exp. 2) …………………………………... 92
Figure 3.7
Model of envelope and spectral FFR………………………………………. 94
Figure 4.1
Calculation of the log envelope……………………………………………. 112
Figure 4.2
Responses and source waveforms………………………………………….. 116 x
Figure 4.3
Cross-correlation procedure……………………………………………….. 118
Figure 4.4
Average correlelogram and transient P1-N1-P2…………………………… 120
Figure 4.5
Transient response model………………………………………………….. 124
Figure 4.6
Average response spectra and spectrum of average……………………….. 126
Figure 4.7
Average envelope-source correlelograms………………………………….. 128
Figure 4.8
Average envelope-source correlations……………………………………... 129
Figure 4.9
Average model-source correlations………………………………………... 134
xi
List of Appendices
Appendix A
Effects of Response Latency…………………………………………………178
xii
1
1
Chapter One: Introduction
1.1 Hearing Loss in Infancy Approximately 1-2 in 1000 children are born with a permanent hearing loss that is at least moderate in severity (Davis et al., 1997; Fortnum et al., 2001; National Health and Medical Research Council, 2002). Such hearing losses are associated with poor academic achievement, increased stress, behavioral and social problems, and low self-esteem (Bess et al., 1998; Chia et al., 2007; Järvelin et al., 1997; Lieu, 2004; Priwin et al., 2007; Teasdale and Sorensen, 2007; Wake et al., 2004). However, children with hearing loss who are identified and treated early (< 6 months) tend to have speech and language skills that are similar to their hearing peers, as well as improved social and emotional development (Downs and Yoshinaga-Itano, 1999; Markides, 1986; Moeller, 2000; Watkin et al., 2007; Ramkalawan and Davis, 1991; Yoshinaga-Itano, 2003; Yoshinaga-Itano et al, 1998). This may be partly due to the importance of early hearing for the acquisition of spoken language. Six-to-twelve month old infants are sensitive to the statistical distribution of speech sounds (phones) in the language(s) that they hear regularly, and they use this information to develop language-specific phonemic inventories and phonotactic expectations (Aslin et al., 1998; Anderson et al., 2003; Jusczyk et al., 1994; Kuhl, 2004; Saffran et al., 1996). They also take advantage of acoustic stress patterns that provide cues for word segmentation (Johnson and Jusczyk, 2001; Saffran and Thiessen, 2003). The development of the ability to understand spoken language is thus rooted in an early exposure to audible speech (Jusczyk, 1997).
1.1.1
Infant Hearing Screening
The importance of hearing in infancy has motivated the development of neonatal hearing screening programs (Hyde, 2005; National Health and Medical Research Council, 2002). These were originally limited to infants who were considered to have a high risk of developing hearing loss (e.g. due to maternal rubella, peri-natal hypoxia, or time spent in a neonatal intensive care unit), but this approach likely missed a large number of infants who had permanent hearing loss (Cone-Wesson et al., 2000; Mauk et al., 1991). It has thus been generally accepted that neonatal
2
hearing screening programs should include all children (Joint Committee on Infant Hearing, 2007; National Health and Medical Research Council, 2002). Hearing screening is usually accomplished by recording otoacoustic emissions – tiny sounds produced by the healthy cochlea in response to an auditory stimulus, or the automated auditory brainstem response – an auditory evoked electrophysiologic response generated in the auditory nerve and brainstem (Joint Committee on Infant Hearing, 2007). Otoacoustic emissions arise from cochlear nonlinearities associated with the outer hair cells, and are generally absent in the presence of a moderate cochlear hearing loss (Bray and Kemp, 1987). Since the emissions must pass through the middle ear before they can be measured in the ear canal, they also tend to be absent in the presence of a conductive hearing loss (Owens et al., 1993). However, they are not affected by auditory deficits beyond the cochlea, such as auditory neuropathy (Starr et al., 1996), which may account for a tenth of all permanent childhood hearing losses (Cone-Wesson et al., 2000; Rance et al., 1999). In contrast, the auditory brainstem response is sensitive to both peripheral deficits and deficits in the auditory nerve and brainstem. The auditory brainstem response is a scalp-recorded transient response to the onset of an auditory stimulus, such as a 100 µs click or 5-cycle tone burst. It is characterized by 7 vertex-positive waves occurring in the first 10-20 ms after stimulus onset, which are labeled with their corresponding roman numerals (Jewett and Williston, 1971). The first two peaks (waves I and II) are generated in the distal and proximal portions of the auditory nerve, wave III likely arises from sources in the cochlear nucleus (although it may also have contributions from the auditory nerve and the superior olivary complex; Møller and Jannetta, 1983), wave IV is likely generated by sources in the superior olivary complex (Møller et al., 1995), and wave V is likely generated by the lateral lemniscus, as it terminates in the inferior colliculus (review: Møller, 2007). The 6th and 7th peaks likely arise from sources in the inferior colliculus (Møller, 2007), but they are rarely used clinically, since they are more variable in presentation. For hearing screening, the ABR is usually conducted at a single stimulus level (e.g. 30-35 dB) with an automatic response detection algorithm to assess whether the sound has been registered in the brainstem (e.g. Kileny, 1988). The earliest automatic ABR detection algorithm compared the recorded response to a template, using a statistic weighted heavily for the most stable parts of the ABR (i.e. wave V and the SN10; Kileny, 1987). A more recent approach (Sininger et al.,
3
2000) is based on the ratio of the variance of the averaged response to the estimated variance of the background noise (i.e. the trial-to-trial variance of the signal at a single point in time), which follows the F distribution (Don et al., 1984). The presence of a valid ABR is established when the ratio (“ the Fsp”) is greater than that which would likely have occurred by chance (Elberling and Don, 1984). A variant of this approach estimates the background noise by computing trialto-trial variance at multiple points (e.g. the “Fmp”; Ozdamar and Delgado, 1996). Although otoacoustic emission screening protocols have been associated with higher falsepositive rates (e.g. 35%; Barker et al., 2000) than automated ABR screening protocols (e.g. 2%, Stewart et al., 2000), a large multi-centre study found that the tests had similar receiver-operating characteristics (i.e. hit rates versus false-positive rates) for detecting hearing loss at 2 and 4 kHz, but that the automated ABR protocol was better at detecting hearing loss at 1 kHz (Norton et al., 2000). Since the automated ABR can detect a wider range of hearing deficits than otoacoustic emissions tests (i.e. conductive, sensory and neural), and has better receiver-operating characteristics than otoacoustic emission tests at some frequencies, it is the preferred screening test for infants who are more likely to have hearing loss (Joint Committee on Infant Hearing, 2007).
1.1.2
Infant Hearing Assessment
After a hearing loss is detected, hearing thresholds can be determined behaviorally or by recording the ABR to soft sounds. Older infants (6-36 months) can be assessed by visualreinforcement audiometry, where a loud sound is paired with a visual stimulus to condition a head-turn response. The conditioned response is then used to assess the audibility of softer sounds. However, young infants (< 4-6 months) cannot be easily conditioned. In lieu of a conditioned response, it is possible to assess thresholds by observing an infant‟s behavior in response to sound (“behavioral observation audiometry”), but this approach is less reliable and accurate than the ABR (Ruth et al., 1982). The most recent position statement of the Joint Committee on Infant Hearing (2007) calls for screening to take place within 1 month after birth, followed by a full audiometric assessment (when indicated) within 3 months after birth. The conductive, sensory and neural pathways are
4
assessed by 1 kHz tympanometry, otoacoustic emissions (transient or distortion-product) and click-evoked ABR, respectively. For establishing hearing thresholds, an ABR is recorded in response to a series of tone bursts decreasing in intensity. Tone bursts are typically 5 cycles in length (2 cycles of onset and offset ramp with a 1-cycle plateau), in order to be abrupt enough to elicit a robust ABR, with reasonable frequency (and place) specificity. Audiometric threshold is established on the basis of the lowest stimulus level at which it is possible to detect a response (Hecox and Galambos, 1974; Stapells, 2000, 2002) – generally wave V, since it has a wellknown latency-intensity function that facilitates its detection at levels approximating behavioral threshold (Picton and Durieux-Smith, 1978). A more recently developed procedure involves recording steady-state responses to amplitude- or frequency-modulated tones (ASSR; Picton et al., 2003; Stueve and O‟Rourke, 2003; Luts et al, 2004, 2005). A modulated tone stimulates the basilar membrane at the region most sensitive to its carrier frequency (at low to moderate stimulus levels), since there is no spectral energy at its modulation frequency (Herdman et al., 2002b). However, its modulation frequency is introduced as a distortion component (created in the process of neural transduction), and a scalp-recorded response can be measured at this frequency (Galambos et al., 1981). At high modulation rates (ca. 70-90 Hz), the response is likely generated primarily in the brainstem, and is little affected by sleep or subject state, while at lower modulation rates (e.g. 40 Hz), the response probably has significant contributions from the cortex (Herdman et al., 2002a; Picton et al., 2003). The ASSR and the ABR produce very similar threshold estimates for stimuli with equal peak-topeak-equivalent or peak-equivalent sound pressure levels (Stapells et al., 2005; Rance et al., 2006), even though ASSR stimuli can be behaviorally detected at lower levels than the shorter ABR stimuli (Rance et al., 2006). However, both ABR and ASSR threshold estimates predict behavioral thresholds with standard deviations that range from about 5 dB to 15 dB across various studies (see Tables 1 and 2, Tlumak et al., 2007 and Table 4 in Herdman and Stapells, 2003; Stapells, 1990). Actual behavioral thresholds may thus differ from estimated thresholds by as much as 30 dB (i.e ±2 standard deviations).
5
1.1.3
Infant Hearing Aid Fitting
The imprecise relationship between electrophysiologic threshold estimates and behavioral thresholds is problematic when an infant needs to be fitted with hearing aids. Methods for fitting hearing aids on infants (e.g. the Desired Sensation Level Method; Scollie et al., 2005; the NALNL1 procedure; Byrne et al., 2001) prescribe gain or output levels on the basis of behavioral thresholds (corrected for individual differences in ear canal acoustics). When estimates of behavioral threshold are incorrect, these methods may not provide optimal – or even audible – levels of speech. Real-ear probe-microphone measurements can be used to verify that hearing aids are set appropriately for estimated hearing losses, but their value is similarly limited by the accuracy of the threshold estimates. Accordingly, hearing aids fit on the basis of electrophysiologic threshold estimates could be providing too much or too little amplification for many infants. For adults and older children, it is possible to estimate the impact of a hearing aid fitting on the intelligibility of speech using real-ear measures and the Speech Intelligibility Index (SII; ANSI S3.5 1997). The SII is based on the aided audibility of speech in a number of weighted frequency-specific bands. Although a similar formula could be derived for infants (to estimate the adequacy of the input for the development of speech perception), the variability associated with the threshold estimation would limit its value. A 5 dB shift in hearing thresholds can change the SII by more than 15 points, which could correspond to a 45% change in intelligibility for some speech materials (Sherbecoe and Studebaker, 2003). This problem is compounded by the lack of any means to validate infant hearing aid fittings. In adults and older children, fittings can be validated by measuring performance on speech tests, or by collecting self-report measures such as the Abbreviated Profile of Hearing Aid Benefit (APHAB; Cox and Alexander, 1995) or the Client Oriented Scale of Improvement (COSI; Dillon et al., 1997). These approaches are obviously inappropriate for young infants, who cannot perform speech tests, or report on whether they are able to hear and understand speech with a hearing aid. What is needed is a means of measuring whether the peripheral representation of aided sound is adequate for the perception of phonetic information.
6
1.1.4
Validating Infant Hearing Aid Fittings
The aided ABR has been proposed as a suitable tool for hearing aid validation (Kileny, 1982; Hecox, 1983; Beauchaine and Gorga, 1988). When comparing aided with unaided conditions, wave V tends to have a shorter latency and its latency-intensity function tends to be more shallow (Hecox, 1983). If an aided ABR were conducted with a speech-level (e.g. 65 dB) stimulus, a normal wave V latency and normal latency-intensity function might suggest that the hearing aid had restored normal loudness for speech sounds. However, wave V latency is only predictive of loudness for relatively flat hearing loss configurations (Serpanos et al., 1997). Moreover, hearing aids respond differently to ongoing sounds (e.g. speech) than they do to the short transient stimuli used to elicit ABRs (Brown et al., 1999; Gorga et al., 1987), so aided ABRs do not provide clear information about speech audibility. Since the ASSR is recorded to ongoing sounds, it has been proposed as a more appropriate evoked response for assessing hearing aids (Dimitrijevic et al., 2004; Kieβling, 1982; Picton et al., 1998, 2001; Stroebel et al., 2007). Aided ASSR threshold estimates provide reliable objective estimates of functional gain (Picton et al., 1998; Stroebel et al., 2007), which could be helpful in lieu of, or as a complement to, other verification techniques (e.g. real ear probe-tube microphone measures). However, the relationship between aided ASSR threshold estimates and aided behavioral thresholds appears to be just as variable as the relationship between unaided estimated and behavioral thresholds (Picton et al., 1998; Stroebel et al., 2007). Also, aided thresholds only provide information about the audibility of soft sounds, and are not particularly helpful for estimating speech audibility. Hearing aids can employ wide dynamic range compression algorithms to increase gain for softer inputs, or expansion algorithms to decrease gain for softer inputs, so the difference between soft sound audibility and speech audibility will vary across different hearing aids and hearing aid settings. Given these issues, it might be more expedient to record aided steady-state responses to speechlevel inputs. For instance, Dimitrijevic et al., (2004) recorded the ASSR to amplitude and frequency modulated tones (designed to resemble natural speech), from normal hearing and hearing-impaired adults with and without amplification. Responses at the various frequency and amplitude modulation rates were moderately correlated with word-recognition scores, suggesting that this approach might be useful for hearing aid validation.
7
The validity of the approach is nevertheless limited by differences in the way that hearing aids might respond to speech-like modulated tones and natural speech. Speech audibility is a function of the relationship between hearing aid processing characteristics (e.g. the number of compression channels, the gain function in each channel, compression time constants, and noisereduction algorithms) and the spectrotemporal distribution of speech sounds (Henning and Bentler, 2005). A stimulus that resembles speech imperfectly will likely be processed differently than natural speech (Stelmachowicz et al., 1996). Moreover, modern hearing aids often incorporate processing algorithms designed to explicitly attenuate non-speech sounds (e.g. Alcántara et al., 2003). Using non-speech stimuli to assess the neural representation of aided speech might therefore produce inaccurate results. These problems could be avoided by recording steady-state responses directly to natural speech stimuli. An evoked response to an element of natural speech (e.g. the glottal source frequency) would indicate that the element had been neurally encoded, thereby validating the hearing aid fitting for that sound. Also, since speech and non-speech sounds appear to be represented differently in the auditory system (Benson et al., 2006; Gailbraith et al., 2004; Shtyrov et al., 2005; Tiitinen et al., 1999), responses to speech stimuli may relate more closely to the neural representations required for spoken language acquisition.
1.2 Relating Speech Perception to Auditory Dysfunction Objective techniques for assessing the auditory encoding of speech may also help elucidate the relationship between speech perception and specific auditory dysfunctions, including sensory loss, auditory neuropathy (AN), and auditory processing disorder (APD). The key presentation of AN and APD is a deficit in auditory perception that is disproportionate to the degree of peripheral hearing loss, or a perceptual deficit that exists in spite of normal peripheral hearing. It is estimated that roughly 1 in 10 infants with non-conductive hearing loss have some form of AN (Rance et al., 1999). The diagnosis is made when otoacoustic emissions and the cochlear microphonic are present (indicating healthy outer hair cells), but subcortical neural responses such as the ABR and acoustic reflex are absent or abnormal (Berlin et al., 1993, 2005; Starr et al., 1996). It likely reflects impaired synchrony (“dys-synchrony”) in the auditory nerve –
8
perhaps due to insufficient myelination (Rance et al., 1999; Starr et al., 1998), or a selective loss of the inner hair cells (Harrison, 1998). The effects of AN on speech perception are not fully understood, but Zeng (1999) has successfully modeled the perceptual deficits in normal hearing individuals by temporally smearing the speech amplitude envelope. Cortical speech-evoked responses might clarify the effects of AN on the neural encoding of speech (Kraus et al., 2000; Rance et al., 2002). In contrast to AN, the utility of APD as a diagnostic entity is controversial, since the symptoms are poorly defined, and the physiologic underpinnings are unknown (Cacace and McFarland, 2005). Assessments of auditory processing often involve difficult speech processing tasks, such as listening to competing sentences and filtered words (e.g. SCAN-C; Keith, 2000), which appear to be sensitive to language impairments and deficits in attention (Cacace and McFarland, 2005; Rosen, 2005). Individuals with APD often perform poorly on psychophysical tests that do not use speech stimuli (Musiek et al., 2005), so their problems cannot be fully explained by impairments in language processing. Nevertheless, it can be difficult to establish that perceptual deficits are specific to the auditory modality, since language impairments often coexist with auditory processing impairments (Johnson et al., 2005; Rosen, 2005). An objective measure of speech encoding might help to determine the locus of the deficit, relating function to physiology.
1.3 Techniques for Studying Neural Representation of Speech There are two types of approaches to study the human nervous system in vivo, where invasive techniques would be inappropriate. The first is to image brain activity by means of changes in the brain's blood flow or metabolism (e.g. functional magnetic resonance imaging or positron emission tomography). This type of approach can provide accurate and detailed information about the localization of neural activity, but cannot precisely specify its timing. The second is to record electrical potentials or magnetic fields that directly reflect neural activity. It is more difficult to localize the sources of potentials and fields recorded at the scalp, but the recorded responses provide detailed information about the timing of the underlying activity. The latter approach is apposite for studying the neural representation of speech, since speech information is inherently dynamic. Neurons tend to produce large responses synchronized to
9
sound onsets, offsets and changes, giving rise to electrical potentials and magnetic fields that reflect the encoding of speech sounds as they occur. Speech sounds are also neurally represented in a more direct way. The auditory system is specialized for the rapid processing of complex temporal information, with many neurons in the auditory nerve and brainstem synchronizing their firing rate to individual cycles of sound waves, as well as to amplitude and frequency modulations. At higher centers in the auditory system (e.g. the auditory cortex), synchronized activity occurs only at slower rates, but responses to stimulus onsets and changes are still precisely timed. These temporally locked responses may be especially useful as indices of neural speech encoding, since they could help to diagnose specific defects in auditory speech encoding at the physiologic level, and not just at the functional level. There are also several practical advantages to the use of electrical potentials. Equipment and operating costs for recording electrical potentials are much lower than the costs involved in recording magnetic fields or imaging hemodynamic or other metabolic activity in the brain. Electrophysiologic recording equipment is also currently used for neonatal hearing assessment as well as clinical assessment of the auditory system. Consequently, the equipment is widely available, and the recording techniques are familiar to clinicians.
1.4 Purpose of Thesis The present thesis will provide an overview of techniques used to assess the encoding of speech in the auditory system, and will introduce two new techniques for recording electrophysiologic responses to speech. Both techniques involve recording neural responses that relate to temporal characteristics of speech. The first concerns brainstem responses to the fundamental frequency and harmonics of vowels, and the second concerns cortical responses to the slow temporal amplitude envelope of sentences. There are several purposes behind this work. The first is to establish techniques that are appropriate for recording responses to natural speech in the brainstem and cortex. Although synthetic speech allows for experimental control of stimulus variation, natural and synthetic speech are not perceptually equivalent (Blomert and Mitterer, 2004; Coady et al., 2007; Schouten and van Hessen, 1992), and there may be differences in the way that these stimuli are processed in hearing aids and other prosthetic devices. The second is to determine the characteristics of the
10
responses. In particular, the goal is to determine whether responses can be reliably recorded in normal hearing individuals, and to specific the relationship between the stimuli and the responses.
1.5 Differentiating Speech and Language It is important to note that the focus of this thesis is evoked responses to speech, and not evoked responses to language. Language is a collection of meaningful symbols, coupled with rules for combing those symbols, that provides a rich medium for communication and thought. Speech is simply an acoustic signal that carries linguistic information. The meaningful symbols of a language are its lexemes, and the rules for combining those symbols are its grammar. Lexemes roughly correspond to individual words, although lexemes (but not words) are independent of inflectional changes. The smallest semantically meaningful units of a language are its morphemes, which can be combined according to morphological rules (part of the grammar) to form words. For instance, the word "balls" is a particular inflectional form of the lexeme "ball", which includes the free morpheme "ball" (meaning "a small spherical object"), and the bound morpheme "s" (meaning "more than one"). Other grammatical rules (e.g. syntactic) specify the phrase-structure of the language. Morphemes can be mapped onto sets of perceptual units such as graphemes or phonemes, which relate the morphemes to various physical language media, such as text, gestures or speech. For instance, the morpheme "s", meaning "more than one", is related to the printed letter "s" by the grapheme . It is also related to a short burst of noise occurring above 4 kHz by the phoneme /s/. The acoustic stimulus that leads to the perception of the /s/ phoneme (i.e. the short burst of noise above 4 kHz) is called a phone, and is written as [s]. Morphemes are only arbitrarily related to particular phonemic or graphemic representations, and do not depend on those representations for their existence. A profoundly deaf person who is literate in English can perceive the written word "laugh" as the grapheme , and map this to the "laugh" morpheme, without any knowledge of the corresponding phonemic representation /læf/. Similarly, an illiterate hearing person who speaks English can perceive the phones [læf] as the phonemes /læf/, and map this to the "laugh" morpheme, without any knowledge of the corresponding graphemic representation . There is some evidence that morphemes are encoded with reference to at least one perceptual representation, since deaf and hearing people
11
(who have different perceptual abilities) show different patterns of brain activation when presented with language in the same medium (e.g. when reading words, hearing individuals show activation of the left angular gyrus and deaf individuals show activation of the right angular gyrus; Neville et al., 1998). Nevertheless, morphemes do not bear any resemblance to their perceptual counterparts (apart from rare exceptions such as onomatopoeic words and iconic signs and pictographs), so the mapping between morphemes and perceptual units is arbitrary. The arbitrary relationship between phonemes and morphemes appears to place a simple division between speech perception and all other language processing. Speech perception involves the mapping of linguistic (i.e. phonetic) information in the acoustic speech signal onto phonemic representations. The mapping of phonemic representations onto their corresponding morphemes is not part of speech perception, because this mapping operates on already existing perceptual representations. However, morpho-syntactic and semantic constraints guide expectations concerning possible morphemes - and thus phonemes - that might occur at any given time (Hagoort, 2003; Marslen-Wilson and Tyler, 1980). This facilitates perception and helps to resolve ambiguities when the acoustic signal is impoverished (e.g. the phonemic restoration effect; Bashford et al, 1992; Sivonen et al., 2006). Morphemes can therefore also map onto phonemes, eliminating the simple division between speech perception and other language processes (also see Davis and Johnsrude, 2007). Speech perception can only be fully separated from other language processes when phoneme-morpheme mappings are unavailable, such as when listening to an unfamiliar language, or when morpho-syntactic and semantic constraints are unavailable, such as when listening to unrelated words.
1.5.1
Evoked Responses to Language
There are a number of evoked responses that correspond to the processing of language. The N400 is a negative deflection (when recorded with a non-inverting electrode on the centralparietal region of the scalp) that occurs approximately 400 ms after the onset of a word that is not congruent with its immediate semantic context (Kutas and Hillyard, 1980). The elicitation of this potential demonstrates understanding of the semantic context as well as the incongruent word, and thus reflects linguistic processing. There are similar potentials which appear to relate to syntactic processing, such as the P600, and the late anterior negativity, which can be elicited by violations of syntactic rules (Friederici and Kotz, 2003; Osterhout et al., 1994; Osterhout, 1997).
12
These responses are all dependent on the successful mapping of auditory or visual stimuli onto their corresponding perceptual units and the mapping of perceptual units onto their corresponding morphemes.
1.5.2
Evoked Responses to Speech
Evoked responses to speech correspond to responses to the acoustic speech signal in the auditory system or associated cortex, or to the perceptual representation of speech. Distinguishing acoustic-phonetic representations from phonemic-perceptual representations is not trivial, since the exact boundaries between pre-perceptual and post-perceptual processing in the auditory system are unknown (review: Eggermont, 2001). The distinguishing feature of perceptual representations is that they are categorical with respect to pre-perceptual information. For instance, the perceptual distinction between the phonemes /p/ and /b/ is that the former is unvoiced (i.e. the vocal cords do not vibrate during consonant production) while the latter is voiced. However, consonants typically precede vowels, which are always voiced (i.e. in nonwhispered speech). Thus, vocal cord vibration typically begins shortly after the consonant is released, even for unvoiced sounds such as /p/. The delay between the release of a consonant and the initiation of voicing, called voice-onset time, can vary over a wide range without any change in the perceptual categorization of the sound. However, there tends to be a sharp boundary in the distribution of possible voice-onset-times that determines whether a /p/ or /b/ will be perceived. In English, this boundary is approximately 30 ms (Abramson, 1977). Accordingly, phonemic representations can be distinguished from phonetic representations by comparing responses recorded to stimuli that differ in phonemically relevant and phonemically irrelevant ways (e.g. Horev et al., 2007; Maiste et al., 1995; Phillips et al., 2000). Responses related to phonemic representations should not vary with phonemically irrelevant stimulus changes, but should vary with phonemically relevant stimulus changes. For example, the N1 is likely not related to phonemic representations, since its amplitude and latency can vary with phonemically irrelevant voice-onset time changes (Sharma et al., 2000; Sharma and Dorman, 2000, but see Obleser et al., 2004; Sanders et al., 2002). Apart from a demonstration of sensitivity to only phonemically relevant stimulus changes, it is difficult to relate a speech-evoked response to its perceptual representation.
13
The mismatch negativity (MMN), N2 and P3 are transient responses elicited by rare stimuli (“deviants”) which differ from more frequently presented “standards,” along one or more dimensions. The MMN is largely preattentive, though attention can modulate the amplitude of the response (review: Picton et al, 2000). In contrast, the N2 and P3 are typically absent in nonattentive subjects, unless a salient task-relevant or novel stimulus succeeds in capturing their attention (review: Picton and Hillyard, 1988). A number of studies have attempted to relate the MMN, N2 and P3 to perceptual representations of speech by comparing responses to deviants differing in phonemically relevant and irrelevant ways (e.g. Maiste et al., 1995; Phillips et al., 2000; Tampas et al., 2005). The N2 and P3 responses are clearly related to perceptual-phonemic representations of speech, since they are larger for phonemically relevant than for phonemically irrelevant deviants (Maiste et al., 1995; Tampas et al., 2005). However, the relationship between the MMN and perception is more controversial. The MMN is not larger for phonemically relevant deviants than for phonemically irrelevant deviants (Maiste et al., 1995; Sharma et al, 1993), but it can be elicited by phonemically relevant deviants which are not deviant in any nonphonemic way (Phillips et al., 2000). It may therefore relate to both acoustic-phonetic and perceptual-phonemic representations (Tampas et al., 2005).
1.5.3
An Agnostic Approach
Any response to speech that cannot be conclusively related to phonemic encoding could be related to an aspect of auditory or cortical processing that is not directly involved in the mapping between speech sounds and phonemic representations. Acoustic speech signals carry both linguistic and extra-linguistic information, such as speaker voice characteristics, sound source location, and characteristics of the acoustic environment. An evoked response to speech could thus be a response to phonetic information in the acoustic signal, the phonological representation of that information, or to non-phonetic aspects of the signal. It is difficult to make the strong claim that a particular response directly reflects processing involved in the mapping of acoustic speech information onto phonemic representations. However, any response to speech indicates that at least some speech information was encoded in the auditory system, and that some detail is likely available for processing. The value of this agnostic approach can be greatly increased with an understanding of the role of various acoustic details in speech perception, and knowledge about how various acoustic details are encoded in the auditory system. It could be further
14
enhanced by relating variation in speech-evoked responses to variation in measured speech perception abilities. If a response is reliably related to speech perception performance, its diagnostic value is not reduced by the possibility that the response reflects processes that are not directly involved in phonemic encoding.
1.6 A Closer Look at Speech The following sections will summarize the ways in which speech information is encoded in the acoustic signal, and the ways in which this information is represented in the auditory system and cortex. These summaries will facilitate the review of current approaches to recording speechevoked responses.
1.6.1
Acoustic Representation of Speech
Speech is a complex signal comprised of spectrally shaped and temporally modulated harmonic and inharmonic sounds. The harmonic sounds characterize voiced speech, which is produced by the quasi-periodic vibration of the glottal folds. The folds come together much more quickly than they open, producing a sawtooth-shaped waveform with energy at both even and odd harmonics of the glottal pulse rate (Fant, 1970). Variations in the glottal pulse rate and its associated harmonics are used to express linguistic intonation (and lexical identity in tonal languages). For instance, a rising fundamental frequency often denotes a question, whereas a falling fundamental usually denotes a statement (Eady and Cooper, 1986). The periodicity introduced by the vocal folds, called voicing, is a distinctive feature of “voiced” phonemes (e.g. /d/, /z/), so it is not universally present. A sentence may contain only voiced consonants (“Where were you a year ago?”; /werwəryuæyirægo/), or only unvoiced consonants (“Catch it!”; /kaʧɪt/). However, since (non-whispered) vowels are always voiced, the harmonic source is never completely absent. The spectrum of the harmonic source decreases by about 12 dB/octave, but as the sound is radiated from the lips, the spectrum is raised by 6 dB/octave. The acoustic spectrum of the voice thus has a 6 dB/octave low-pass characteristic. However, it is rarely static. In ongoing speech, the spectrum of the voice is continuously modified as the primary articulators – the tongue, jaw and lips – alter the resonance characteristics of the vocal tract. The broad vocal tract resonances
15
are called formants. The positions and trajectories of the lowest two-to-three formants convey phonemically relevant information concerning vowel identity (Hillenbrand et al., 1995; Peterson and Barney, 1952) and consonantal place-of-articulation (Fruchter and Sussman, 1997; Sussman, 1991). The spectrally shaped harmonic signal is also regularly augmented by turbulent noise, and interrupted by closures of the vocal tract. These interruptions are called fricatives and plosives, respectively. Fricatives occur in both voiced speech (e.g. /z/, /v/) and voiceless speech (e.g. /s/, /f/). Their acoustic energy derives from the turbulence associated with a narrow constriction of the vocal tract, which does not preclude (or require) glottal source activity. Plosives involve complete closure of the vocal tract, interrupting all air flow – including energy from the glottal source. However, plosives with short interruptions of voicing are nonetheless considered to be voiced (e.g. /b/, /d/, /g/), as discussed in section 1.5.2. The speech signal is therefore dynamic on multiple levels. The frequencies in the glottal source and its associated harmonics vary simultaneously with changes in the overall spectral shape, and these sounds are regularly interrupted by silent intervals and bursts of noisy energy. The intensity of the glottal source also varies across time, with maxima corresponding to syllabic nuclei for both stressed and unstressed syllables. This produces an overall amplitude envelope modulation at approximately 2-8 Hz (Drullman et al, 1994). For voiced speech, the opening and closing of the glottis also naturally modulates the harmonic energy associated with the voice at the glottal pulse rate, producing a higher-frequency glottal pitch envelope (Aiken and Picton, 2006).
1.6.2
Speech in the Auditory System
The low-pass characteristic of the acoustic speech signal is substantially modified by the transmission characteristics of the outer and middle ear. The transfer function of the outer ear peaks at 2-3 kHz, where it provides 15-20 dB gain (relative to sound measured in the same location in a free field), with a secondary peak circa 5 kHz (Shaw, 1966; Wiener and Ross, 1946). The 2-3 kHz peak arises from the resonance of the external auditory canal, and the 5 kHz peak is due to the resonance characteristics of the concha. The middle ear transfer function has a broad maximum from 0.5 to 2 kHz, with about 10 dB less energy transmitted at 0.3-0.4 kHz, and
16
almost 20 dB less energy at 0.1 kHz (Puria et al., 1997). The combined effect of the outer and middle ear transmission characteristics is to reduce auditory sensitivity below 1 kHz (Harrison, 2007), effectively eliminating the low-pass characteristic of the acoustic speech signal at the level of the cochlea. The encoding of speech in the auditory nerve has been thoroughly studied with steady-state vowels (e.g. Delgutte and Kiang, 1984; Miller et al., 1997; Sachs and Young, 1979; Young and Sachs, 1979). The cochlea is arranged tonotopically, such that apical regions of the basilar membrane respond preferentially to low frequencies, with sensitivity shifting to higher frequencies toward the base. This is partly related to a stiffness gradient that decreases toward the apex, which gives rise to a travelling wave with a frequency-dependent displacement maximum. The speech signal is neurally transduced by inner hair cells distributed throughout the cochlea, which tend to fire (i.e. generate action potentials) at a rate proportional to the displacement of the basilar membrane. The speech spectrum is therefore reflected in a spatial distribution of neural firing rates (Kiang, 1980; Sachs and Young, 1979; Sachs and Young, 1980). The tonotopic arrangement of the cochlea is preserved in the auditory nerve and throughout the auditory system (Harrison et al., 1996, 1998), so this rate-place code is a plausible means of representing spectral details (e.g. formant frequencies) required for speech perception. Early research suggested that vowel spectra were well represented by a rate-place code in the auditory nerve at low and moderate stimulus levels, but not at high levels (e.g. above 70 dB; Sachs and Young, 1979). Since vowel perception is robust at high levels, this suggested that a rate-place code was not sufficient to support speech perception. However, there are several reasons to reject this conclusion (review: Palmer and Shamma, 2004). First, the recordings of Sachs and Young (1979) were made with fibers with low discharge-rate thresholds (high spontaneous firing rates). Recordings with fibers with higher thresholds do encode vowel spectra at high levels (Silkes and Geisler, 1991). Second, the recordings were made with anesthetized cats, which could have reduced efferent activity (e.g. the stapedial reflex) that might otherwise have improved the rate-place representation. Third, the recordings were made with steady vowels, even though vowels are rarely static in real speech (Hillenbrand et al., 1995). Auditory nerve fibers respond to onsets over a wider dynamic range, so a changing vowel might be represented more clearly in the rate-place profile. Fourth, the rate-place response may have been compromised by the characteristics of the cat cochlea, which represents a wider range of
17
frequencies in a smaller area. Recio and associates (2002) modified a vowel token (/ɛ/) so that its spatial distribution in the cat cochlea would be similar to the distribution of a natural /ɛ/ in a human cochlea. The resulting rate-place profile provided a clear representation of the vowel spectrum at the highest level tested (80 dB SPL). Speech information is also represented temporally. Inner hair cells generate action potentials preferentially during the rarefaction phase of the stimulus (i.e. when the basilar membrane is displaced upwards toward the scala vestibuli; Brugge et al., 1969). Frequencies in the stimulus are thus reflected in the temporal pattern of the neural response. This synchronization, called phase-locking, is another means by which speech details may be encoded (Delgutte, 1980; Kiang, 1980; Young and Sachs, 1979). Inner hair cell receptor potentials do not vary in phase with frequencies above 3-5 kHz (Russell and Sellick, 1977), so phase-locked neural responses are limited to frequencies lower than this (Eggermont, 2001; Harrison, 2007). Nevertheless, this range is sufficient to encode the harmonics of the voice and the first two formants (Peterson and Barney, 1952). The spectral characteristics of a neuron‟s temporal response can be determined by computing the Fourier transform of its period or interval histogram (Delgutte and Kiang, 1984; Sachs and Young, 1980). For the auditory nerve, response spectra show temporal responses at formant peaks (when the formants overlap with harmonics; Young and Sachs, 1979), or at harmonics near formant peaks (Delgutte and Kiang, 1984), with neurons responding best to harmonics close to their characteristic frequencies. Vowel spectra are thus represented in the temporal firing patterns of the auditory nerve. This encoding is robust with respect to level, with increasing synchrony to formant-related harmonics at higher levels (see Figure 9, Sachs and Young, 1980). The synchronized response to upward deflections of the basilar membrane creates a half-wave rectified analogue of the stimulus (Brugge et al., 1969). This introduces energy corresponding to the stimulus amplitude envelope into the temporal pattern of the neural response (it is this explicit encoding of the stimulus envelope that is measured as the auditory steady-state response – see Picton et al., 2003). For a voiced speech signal, this rectification introduces energy corresponding to the glottal pitch envelope (Aiken and Picton, 2006; Dajani et al., 2005a; Delgutte and Kiang, 1984). Thus, although energy at the glottal pulse rate is attenuated by the
18
transmission characteristics of the outer and middle ear, it is reintroduced by rectification of the stimulus in the process of inner hair cell transduction. Other non-linear distortion products are introduced into the speech signal prior to inner hair cell transduction. Active processes associated with outer hair cell motility (Brownell, 1990) and/or non-linearities related to stereocilia mechanics (Liberman et al., 2004) create intermodulation distortion in response to a stimulus. These distortion products can be measured in the ear canal as otoacoustic emissions, or in the auditory system using near-field (Miller et al., 1997; Sachs and Young, 1980) or far-field electrophysiologic recordings (Purcell et al., 2007). Intermodulation distortion products occur at many difference frequencies (e.g. fb-fa, 2fb-fa, 3fa-fb), but are largest at the 2fa – fb cubic difference frequency. Neural responses at distortion products associated with prominent speech harmonics have been detected in the phase-locked responses of the auditory nerve (Miller et al., 1997; Sachs and Young, 1980) and also in the human frequency-following response (e.g. see Chertoff et al., 1992; Krishnan, 2002; Pandya and Krishnan, 2004; Rickman et al., 1991). The temporal response patterns of auditory nerve fibers thus encode the harmonics of voiced speech, the glottal pitch envelope, and non-linear distortion products arising from interactions of speech harmonics. This representation occurs alongside a rate-place representation of the vowel spectrum. These simultaneous representations of the speech spectrum could support an integrated temporal-place code, where the temporal characteristics of a neuron‟s response are considered in relation to the neuron‟s characteristic frequency (e.g. the average localized synchronized rate; Delgutte, 1984; Young and Sachs, 1979). The plausibility of an exclusively temporal encoding of vowel spectra is limited by several factors. First, temporal response spectra (i.e. Fourier transforms of period and interval histograms) show clear responses at speech harmonics, but not at formant peaks (Delgutte and Kiang, 1984), except when speech harmonics are not present (e.g. in whispered speech; see Voigt et al. 1982). Harmonics are not formants (review: Rosner and Pickering, 1994). Formants reflect the resonant properties of the vocal tract, which are rarely static in natural speech, even for monophthongs (Hillenbrand et al., 1995). Harmonics are distortion products arising from the sawtooth-shaped glottal source wave form, at integer multiples of the glottal pulse rate. Formant resonances affect the amplitude of speech harmonics, but their frequency trajectories are largely
19
independent. For instance, the sentence “Did you see the cow?” is likely to be spoken with the fundamental frequency and harmonics rising on the final word. However, the word “cow” ends with the diphthong /æu/, which has falling formant frequencies. It is the harmonics (along with the glottal pitch envelope) – characteristics of the voice – which are explicitly represented in the temporal response patterns of the auditory nerve. Nevertheless, harmonics may be used in the estimation of formant positions (e.g. by calculating the average amplitude profile of several harmonics), so temporal representations of formant-related harmonics could play some role in formant perception. Second, temporal response spectra are restricted to lower frequencies at higher levels of the auditory system (Palmer and Shamma, 2004). Phase locking is limited to frequencies below 1 kHz in the inferior colliculus of the guinea pig (Liu et al., 2006) and cat (Langner and Schreiner, 1988) and to 250 Hz in the auditory cortex of the guinea pig (Wallace et al., 2002). Human phase-locking abilities appear to be similar, since sound localization tasks (presumably related to brainstem processing) which depend on temporal encoding are limited to about 1300 Hz (Zwislocki and Feldman, 1956). This is insufficient to represent the second formant of many vowels. Therefore, although formant information is indirectly represented by temporal responses to speech harmonics in the auditory nerve, formant information must ultimately be encoded nontemporally (e.g. via a rate-place code) in the brainstem. A plausible neural substrate for a rate-place encoding of vowel spectra has been found in the stellate cells of the posteroventral division of the cochlear nucleus. They provide a robust leveltolerant rate-place representation of vowel spectra (Blackburn and Sachs, 1990; May et al., 1998), with a temporal firing pattern synchronized to the glottal pitch envelope (Kielson et al., 1997). Temporal responses may contribute to vowel perception at the level of the auditory nerve, but this is less likely in the brainstem. Nevertheless, the temporal encoding of voicing-related information persists in the brainstem, up to approximately 1500 Hz. Since much of the processing in the brainstem is involved in sound-localization, this temporal representation might be important for distinguishing and localizing voices in auditory space (Eggermont, 2001). Phase-locking in the cortex is limited to a few hundred Hz, so the cortex does not have sufficient bandwidth to temporally encode most speech sounds. Spectral details, such as voice pitch,
20
formant frequencies, and formant frequency trajectory, appear to be coded in topographic maps (Steinschneider et al., 1990, 1995). However, the cortex synchronizes its response to stimulus onsets, changes and low frequency amplitude modulation (Eggermont 1995; Steinschneider et al, 1994). Dynamic speech cues are thus likely represented by synchronous activity in multiple topographic maps (review Eggermont, 2001).
1.7 Electrophysiologic Responses to Speech The profusion of neural responses synchronized to dynamic speech characteristics bodes well for the development of electrophysiologic indices of neural speech encoding. A number of these temporally locked responses have already been explored, and will be reviewed in the present section. Responses can be broadly categorized as either transient or steady-state. Transient speechevoked responses reflect neural activity that occurs in response to speech events (e.g. sound onsets, offsets and changes). However, these responses do not bear any resemblance to the stimulus. For instance, the transient brainstem response (ABR) to a 4-6 kHz burst of frication noise (/s/) would likely have most of its energy near 500 Hz (reflecting the approximate 2 Hz spacing between the three most prominent peaks of the ABR). In contrast, steady-state responses reflect activity that is time-locked to periodic stimulus components or modulations, and tend to resemble the stimulus elements to which they are locked.
1.7.1 1.7.1.1
Brainstem Responses to Speech Transient Responses
Transient brainstem responses (ABRs) to speech are usually recorded with a consonant-vowel diphone stimulus (e.g. /dɑ/; Banai et al., 2007; Cunningham et al., 2001). The short-latency activity following the onset of the /dɑ/ begins with the standard peaks of the ABR (I-V; Jewett and Williston, 1971), followed by a sharp negativity at 7-8 ms (wave „A‟). A steady-state response can be measured from the onset of the voicing (wave „C‟), through three negative troughs (waves „D-F‟), followed by an offset response (wave „O‟; see Figure 1, Banai et al,
21
2007). These responses been used to investigate the brainstem encoding of speech in children with language-based learning problems (Banai et al., 2005; Cunningham et al., 2001; King et al., 2002; Johnson et al., 2007; Wible et al., 2004). A subset of these individuals show abnormal responses to this stimulus, with delayed waves V, A, C, and F, and shallower V-A slopes. In contrast, click-evoked responses tend to be normal or only very slightly delayed in this population (McAnally and Stein, 1997; Purdy et al., 2002; Song et al., 2006). Auditory processing problems have been cited as possible causes of language disorders such as dyslexia and specific language impairment (e.g. Tallal, 1980; Tallal et al., 1993), but this is a controversial claim (review: Ramus, 2003). It is possible that auditory processing problems lead to language problems in some individuals, or it may be that people with language problems cannot compensate for auditory difficulties which would otherwise go unnoticed. An auditory processing problem that makes speech perception difficult is nonetheless an important audiologic concern. Unfortunately, the most popular test of auditory processing uses only word and sentence stimuli (e.g. the SCAN-C; Keith, 2000), thereby confounding linguistic and auditory processing (Rosen, 2005). An abnormal response to speech from the brainstem could unequivocally establish the auditory nature of the deficit. Transient brainstem responses to speech sounds have not been evaluated in children or adults wearing hearing aids, although it would be possible to record such responses to speech components with relatively rapid onsets. For instance, one might calculate an average response to an /s/ sound embedded in words or sentences. A clear /s/-evoked ABR would indicate that the sound had been registered in the nervous system, and that the hearing aid was providing sufficient gain in the corresponding spectral region.
1.7.1.2
Steady-State Responses
The stimulus-response resemblance associated with steady-state responses could be exploited to gain information about the neural encoding of speech sounds that is not provided by transient responses. While a transient response could show that a stimulus component was registered in the neural system, a steady state response could provide information about how the stimulus component was encoded. For example, models of auditory nerve discharge patterns (i.e. Fourier
22
transforms of simulated post-stimulus time histograms) have been used to evaluate the effects of hearing aid processing schemes on the neural representation of sound (Bruce, 2004). Since the auditory nerve produces potentials that can easily be recorded at the scalp (see section 1.1.1), it may be possible to measure these discharge patterns directly (e.g. Galbraith et al., 2000). Steadystate responses can also provide information about the temporal encoding of speech sounds in the higher brainstem (Herdman et al., 2002a; Smith et al., 1975) Chapter 2 will review studies which have recorded steady-state responses to speech, and will introduce an approach for recording steady-state responses to natural vowels with a Fourier analyzer. The experiments that are described in Chapter 2 tested the ability of the Fourier analyzer to measure responses to the speech fundamental (i.e. the glottal pitch envelope) in vowels with steady and changing pitch, as well as with a multi-vowel stimulus. Chapter 3 will describe a series of experiments that applied the Fourier analyzer technique to analyze speech responses to stimulus components across a wider range of frequencies (e.g. the fundamental and higher speech harmonics). The experiments in Chapter 3 also attempted to determine whether the responses were related to the speech envelope, or to spectral components in the speech signal.
1.7.2
Cortical Responses to Speech
Steady-state responses from the brainstem can be used to show that speech sounds are registered in the auditory system, and may be informative with respect to brainstem auditory processing, but this approach has several limitations. Steady-state responses reflect neural activity that is synchronized (i.e. phase-locked) to the speech stimulus or a related distortion product (e.g. the envelope or the cubic distortion product). As discussed in section 1.6.2, phase-locking is limited to approximately 1500 Hz in the brainstem. However, much of the information in speech (e.g. the second formant) is at frequencies above 1500 Hz (Studebaker and Sherbecoe, 2002). Brainstem steady-state responses thus cannot be recorded to many of the important components of speech. Hearing loss tends to correlate positively with frequency, so it would be nice to have a means of assessing high-frequency speech information. This problem could be addressed by recording cortical responses to speech. The cortex produces responses that are synchronous with onsets, offsets and other acoustic changes (see section
23
1.6.2). Transient cortical responses could thus be measured to changes in both high and low frequency speech components, and could be used to infer their audibility. It may also be possible to record steady-state responses to the natural low-frequency (3-7 Hz) speech envelope. These approaches will be discussed in the following section.
1.7.2.1
Transient Responses
The mismatch negativity is a transient response that is directly related to auditory discrimination. Although it would be possible to use this response to validate hearing aid fittings, or to assess the effects of auditory processing problems on speech encoding, the response is often difficult to elicit in individual subjects. This is likely because it is a difference measure based on the response to a low-probability “deviant” stimulus, which is noisier since it includes fewer averages (review: Picton, 2000). Attention has therefore been focused on the P1-N1-P2 complex, which can be reliably elicited in individual subjects (Tremblay et al., 2003). The P1-N1-P2 complex is an obligatory response that is sensitive to the detection of onsets, offsets and acoustic changes in speech (Agung et al., 2006; Kelly et al., 2005; Martin et al., 2007; Ostroff et al., 1998; Tremblay et al., 2006a, 2006b). It is at least partly generated in the superior temporal lobe, in the vicinity of the auditory cortex. The P1 likely reflects post-synaptic activity in the primary auditory cortex (Wood and Woolpaw, 1982), although it may also have contributions from adjacent areas (e.g. the planum temporale; Liégeois-Chauvel et al., 1999). It tends to be smaller than the N1 and P2 in adults, although it is often the largest peak in young children (Näätänen and Picton, 1987). The N1 has multiple generators, including an N1b component that is at least partly generated in the superior temporal lobe near the primary auditory cortex, and a T-complex that originates in secondary auditory cortex (review: Näätänen and Picton, 1987). The P2 also likely has multiple generators, including one in the primary auditory cortex (Scherg et al., 1989) and one in the secondary auditory cortex (Hari et al., 1987). Given its sensitivity to the detection of acoustic changes, the speech-evoked P1-N1-P2 might be suitable for the validation of hearing aids in infants (Golding et al., 2007; Korczak et al., 2005; Rance et al., 2002; Tremblay et al., 2006ab; but see Billings et al., 2007). An interesting feature of the P1-N1-P2 complex is that it appears to be sensitive to both auditory processing difficulties
24
(Hayes et al., 2003), and changes in speech sound processing after training. For instance, when subjects were presented with a string of nonsense words, the N1 to word onset was larger after the nonsense words were learned (Sanders et al., 2002). Similarly, after training subjects to distinguish a phonemically irrelevant phonetic difference (i.e. pre-voiced vs. voiced sounds for speakers of English), the N1-P2 response amplitude to that difference was enhanced (largely because of an increased P2; Tremblay and Kraus., 2002). The response might therefore be useful for assessing the development of auditory processing skills after provision of a hearing aid (Cone-Wesson and Wunderlich, 2003). Whereas a brainstem response likely reflects the simple encoding of acoustic speech information in the neural system, a cortical response might reflect the brain‟s ability to process the information.
1.7.2.2
Steady-State Responses
Aided cortical transient responses to acoustic changes may be difficult to record to natural speech, since speech is simultaneously dynamic on multiple levels. The shape of the vocal tract is not held constant while the glottal pulse rate changes – it varies together with changes in the frequency of the voice, and even changes in the presence of voicing (see sections 1.1.6; 1.6.2). The relationship between a particular change in the speech signal and a cortical transient response might be obscured or distorted by responses to other changes occurring at the same (or nearly the same) time. This ambiguity could be resolved by recording steady-state responses to patterns of acoustic change in speech. A correspondence between a steady-state response and a pattern in the stimulus would indicate that the pattern had been neurally encoded. Chapter 4 will report on a study that recorded evoked potentials to the slow temporal amplitude envelopes of sentences. This technique used a windowed cross-correlation procedure to relate cortical responses to the speech envelope as it changed in time. The purposes of this study were to determine whether cortical envelope-following responses could be reliably recorded to a number of different sentences in individual subjects, and to relate these responses to the transient P1-N1-P2 responses that are more commonly recorded.
25
1.7.3
Prelude
The following three chapters will describe electrophysiologic methods for objectively assessing the neural encoding of speech in the brainstem (chapters 2 and 3) and the cortex (chapter 4). All of the methods are based on steady-state responses to speech components (e.g. the glottal pulse envelope and the slow temporal modulation envelope) and use analysis techniques that are optimized to detect responses to those speech components. The goal of these studies was not to provide a full, unbiased description of the characteristics of the auditory electrophysiologic response to speech, but to see if components of the acoustic speech signal were reflected in the evoked response, and therefore registered in the auditory nervous system.
26
2
Envelope Following Responses to Natural Vowels
2.1 Abstract Envelope following responses to natural vowels were recorded in ten normal hearing people. Responses were recorded to individual vowels (/ɑ/, /i/, /u/) with a relatively steady pitch, to /ʌ/ with a variable and steady pitch, and to a multi-vowel stimulus (/ʌui/) with steady pitch. Responses were analyzed using a Fourier analyzer, so that recorded responses could follow the changes in the pitch. Significant responses were detected for all subjects to /ɑ/, /i/ and /u/ with the time required to detect a significant response ranging from 6 seconds to 66 seconds (average time: 19 seconds). Responses to /ʌ/ and /ʌui/ were detected in all subjects, but took longer to demonstrate (average time: 73 seconds). These results support the use of a Fourier analyzer to measure envelope following responses to natural speech.
27
2.2 Introduction Since they require subjective judgment and response, most audiologic tests cannot be used in infants. This is unfortunate since we would like to ensure that hearing is normal in the first few years of life so that speech and language can develop properly. However, infant hearing can be assessed using physiological measurements, which are considered objective since they do not require any subjective response from the subjects. Pure tone thresholds can be estimated by recording the auditory brainstem response (ABR) to brief tones (Stapells, 2000, 2002) or the more recently developed auditory steady state responses (ASSR; Picton et al., 2003; Stueve and O‟Rourke, 2003; Luts et al, 2004, 2005). When a hearing loss is diagnosed, these threshold estimates can be used in conjunction with real-ear probe measurements (Byrne et al., 2001; Seewald and Scollie, 2003) to select and fit hearing aids. By making speech audible, such early intervention can significantly reduce the impact of hearing loss on speech and language development. Despite its importance, providing audible speech information does not guarantee its successful use. The amplified speech must also be intelligible, and the impaired auditory system must be capable of processing the information. Probe microphone measures can verify speech audibility (Scollie and Seewald, 2002) provided that the estimated thresholds accurately reflect the underlying hearing deficit, but these measurements do not evaluate the effects of the hearing-aid on speech intelligibility or the effects of suprathreshold deficits in the auditory system. Distortion can occur in the hearing aid or in the pathological auditory system. An objective test of speech discrimination would be very helpful in fitting aids to young infants and monitoring their effect. Brainstem evoked potentials – either transient or steady-state responses – may provide a means of objectively assessing that speech is being discriminated. Although much of the neural processing underlying speech perception takes place more rostrally, brainstem measurements could verify that speech has been adequately registered in the nervous system. Moreover, there is evidence that some suprathreshold deficits (e.g. difficulty hearing in noise) may relate to abnormalities of speech-evoked transient ABRs (Cunningham et al., 2000; King et al., 2002; Russo et al., 2004). Steady-state brainstem potentials can be reliably measured from very young children, and can be reliably recorded during sleep (Cohen et al., 1991; Lins and Picton, 1995). Several studies have demonstrated that these can be recorded using speech and speech-like
28
stimuli (e.g. Krishnan, 2002; Cunningham et al., 2001; Dajani, 2005a; Dimitrijevic et al., 2004; Galbraith et al., 1997, 1998, 2004; King et al., 2002; Krishnan et al, 2004).
2.2.1
Acoustic Variability in Natural Speech
Speech is far more complex than the simple stimuli normally used to evoke brainstem responses – brief clicks or tones for the transient ABR and modulated tones for the ASSR. In voiced speech, a periodic source signal is generated by the vocal folds. The harmonics of this fundamental signal are continuously modified in intensity as the primary articulators – the tongue, jaw and lips – alter the resonance characteristics of the vocal tract. The harmonics are also regularly interrupted during unvoiced phonetic segments. The ebb and flow between intense harmonic portions and silent or noisy intervals determines the syllabic rate. This translates into an overall amplitude modulation rate of roughly 2-8 Hz (Drullman et al, 1994). This is further compounded by variations in the frequency of the fundamental itself, which are the acoustic manifestations of linguistic intonation (and of lexical identity in tonal languages). The spectral and temporal variability of natural speech makes it an awkward stimulus for eliciting predictable electrophysiologic responses. Nevertheless, speech is similar to the stimuli used in ASSR response measurement. These stimuli are usually amplitude or frequency modulated carrier tones. The carrier frequencies determine the regions of the basilar membrane that mediate the evoked response, and the evoked response is measured at the frequency of the modulation envelope. Speech harmonics are similar in that they are naturally amplitude modulated at a fundamental frequency by the opening and closing of the glottis. Additionally, ASSR carrier frequencies are usually selected to be in the range of the harmonics of speech (5004000 Hz). Although the amplitude of any particular speech harmonic may vary a great deal due to formant movement (e.g. when a harmonic is not part of a formant resonance, it may be inaudible), all of the harmonics in the speech signal are amplitude-modulated at the fundamental frequency, and the fundamental is therefore consistently present in the voiced speech envelope. Neurophysiologic studies have found a robust representation of the fundamental frequency envelope at the level of the auditory nerve (Cariani and Delgutte, 1996; Schilling et al., 1998),
29
that appears to be enhanced at the level of the cochlear nucleus (Frisina et al., 1990; Kim et al., 1990; Rhode and Greenberg, 1994; Rhode, 1994, 1998) and inferior colliculus, the likely source for envelope following responses measured at the vertex (Smith et al., 1975). Thus, while it might be difficult to measure frequency following responses to individual speech harmonics, it should be relatively easy to measure an envelope following response at the fundamental frequency.
2.2.2
Analysis of Natural Speech
An important difference between the modulations used in traditional ASSR measurement and voice pitch modulations in the speech envelope is that the latter vary continuously in frequency. Steady state responses are typically transformed into the frequency domain via a Fast Fourier transform (FFT), such that the response energy at the modulation frequency can be precisely determined (see Picton et al, 2003), but these transforms are designed to detect responses at a stable frequency and do not work efficiently for responses that vary in frequency (Dajani, 2005a). The fundamental frequency of the voice can vary at rates of up to several hundred Hz per second. If a one second window is used to measure the response to a voice frequency that is increasing at 10 Hz per second, the resolution of the spectrum is 1 Hz (the reciprocal of the duration of the time signal submitted to the Fourier transform). Since the voice frequency is increasing at a rate of 10 Hz per second, the response will register energy in 10 analysis bins (the original frequency and the frequencies that it passes through during the one-second recording window) but the response will only have been in each bin for 0.1 second, and the energy in each bin will therefore be one tenth the total response energy. Even if the 10 bins are summed to calculate the total response energy, the electrophysiologic noise in each of the bins will also be summed, thereby greatly decreasing the ratio of response energy to noise energy. One way to eliminate the problem is to use synthetic speech, where the fundamental frequency can be held completely constant. Krishnan (2002) recorded frequency following responses to synthetic vowels with steady fundamental frequencies, and analyzed the result using an FFT. Responses were detected at the fundamental and at various harmonics, with larger response amplitude at harmonics near formant peaks. Synthetic speech has also been used to evoke transient ABRs. In a series of studies concerning children with learning problems (Cunningham
30
et al., 2000; King et al., 2002), ABRs were recorded to a synthetic /da/ stimulus. Cunningham et al. (2001) also recorded frequency following responses (FFRs), which were then analyzed using an FFT. Another option is to create modulated tones that closely resemble certain speech parameters. Dimitrijevic et al. (2004) created stimuli with independent amplitude and frequency modulation (IAFM) that were similar to speech with respect to spectral shape, carrier frequencies, modulation frequencies, and the depths of amplitude- and frequency-modulation. They found a significant correlation between the number of these steady state responses that were detected and word recognition scores. In spite of the benefits associated with using synthetic speech and speech-like modulated tones, natural speech would be preferable for a test designed to assess the transmission of speech information to the brainstem during a hearing aid fitting. Hearing aids respond differently to speech than they do to other sounds (Scollie and Seewald, 2002), and many are designed to specifically amplify speech and attenuate non-speech sounds. The auditory system is also highly non-linear, and may respond differently to speech than to non speech sounds (possibly even in the auditory brainstem – see Galbraith et al., 2004). The FFT has been used successfully to analyze electrophysiologic responses evoked by natural speech. Galbraith and colleagues (1997; 1998) used FFTs to analyze FFRs evoked by naturally produced vowels. These studies were conducted to determine whether various parameters, such as attention, could affect the amplitude of the FFR. The amplitude of the FFR significantly varied as a function of the experimental treatments. While this provides some support for the use of the FFT with natural speech, these studies were only concerned with relative response amplitudes. Although the FFT does not provide an optimal estimate of response energy for changing frequency components, these relative changes would still be valid. Studies concerned with absolute response amplitudes have generally avoided using the FFT to analyze brainstem responses evoked with natural speech. Krishnan and colleagues (2004) recorded frequency following responses to speech sounds with varying fundamental frequencies, by using a short-term autocorrelation algorithm. This provided estimates of primary pitch period and pitch strength for the speech stimulus and the electrophysiologic response. Results showed that the frequency following response closely tracked the pitch of the speech stimulus. Dajani
31
and colleagues (2005a) analyzed pitch-variant speech evoked responses using a novel filterbankbased approach, inspired by the physiology of the cochlea (Dajani et al., 2005b). This involved filtering the evoked response into a large number of overlapping bands, and then determining the band with the highest response energy on an ongoing basis. These results also supported a robust representation of stimulus pitch period in the evoked response. Although these techniques have been used successfully to measure non-stationary responses, they are not without limitations. Short term autocorrelation techniques assume the signal to be stationary over the analysis window, which must be at least twice as long as the period of the lowest frequency to be detected (e.g. in order to measure a 100 Hz tone, the analysis window must be 20 ms long) and may be inaccurate if the signal varies within this window. While the filterbank approach of Dajani and colleagues (2005b) allows for precise measurement of nonstationary signals in frequency and time, it is computationally very demanding.
2.2.3
The Fourier Analyzer
If the frequency trajectory of the response can be predicted in advance (e.g. if the response is expected to follow the pitch of the voice), a better alternative is to use a Fourier analyzer (see Regan, 1989). In contrast to the DFT and FFT, which analyze the signal over a spectrum of stationary frequencies, a Fourier analyzer relates the recorded response to a reference frequency that need not be stationary. The analyzer can therefore follow frequency variations that occur within a single analysis window. If the response follows the reference frequency, the relationship between the response and the reference does not change, so the measured response amplitude is not affected by the frequency variation. This technique thus provides an accurate measure of non-stationary responses. The present study investigated the possibility of measuring the brain‟s response to vowel sounds with a digitally implemented Fourier analyzer. Responses were recorded to vowels in five separate experiments. In the first experiment, responses were recorded to three vowels (/i/ as in „beet‟, /ɑ/ as in „bought‟, and /u/ as in „boot‟) which were produced naturally with no intentional variation of fundamental frequency. The goal of this experiment was to test the analyzer in a relatively simple condition, and ensure that the responses could be consistently recorded to the
32
different vowels. In the second experiment, responses were recorded to an /ʌ/ (as in „but‟) vowel with a changing fundamental frequency, and to two stable portions of the same vowel, in order to verify the ability of the analyzer to measure non-stationary responses. In the third experiment, responses were recorded to a stimulus composed of three contiguous vowels (/ʌ/, /u/, and /i/), at approximately the same fundamental frequency. This assessed the performance of the analyzer with a relatively steady voice pitch but varying formant resonances. A fourth experiment investigated the effects of stimulus bandwidth and vowel identity on response amplitude, in order to understand the differences in response amplitude found in the first three experiments. Since the responses are expected to vary in frequency, they are not actually “steady state” and will be referred to as “envelope following responses” (Dolphin, 1997; Purcell et al, 2004), instead of “auditory steady state responses.” Electrophysiologic responses following the voice pitch have often been called “frequency following responses” (e.g. Krishnan et al, 2004), but this implies that the responses follow the fundamental component (i.e. the first harmonic). It is more likely that the responses follow the envelope of the speech (in which the fundamental is well represented) rather than the first harmonic. The envelope modulates the acoustic energy at the formant frequencies (which are much more easily heard than the first harmonic). However, since the first harmonic was present in all stimuli used in experiments 1-4, a fifth experiment was conducted to determine the effect of removing the first harmonic. In this experiment, the changing fundamental /ʌ/ was presented either with no first harmonic (i.e. no frequency to follow) or with only the first harmonic (i.e. no envelope to follow).
33
2.3 Methods
2.3.1
Subjects
Five women (ages 21-30) and five men (ages 27-38) were recruited internally at the Rotman Research Institute for the first three experiments. Seven women (ages 21-29) and 1 man (age 39) were similarly recruited for the fourth and fifth experiments. Three subjects participated in all five experiments. Hearing thresholds were obtained for each subject using a GSI 16 audiometer and EAR-Tone 3A insert earphones (re ANSI S3.6 1989). All subjects had hearing thresholds that were 15 dB HL or better at octave frequencies from 250 to 8000 Hz.
2.3.2
Stimuli
For the first experiment, three vowels were recorded in a double-walled sound-attenuating chamber, using a resolution of 16 bits and a sampling rate of 32000 Hz. These vowels were /ɑ/, /i/, and /u/, as produced naturally by the first author, over durations of approximately 4 s. A 1.5second sample of each vowel was extracted from these recordings using MATLAB (Mathworks, Natick, MA). The sample had to be exactly 1.5 s in length, with no large variations in amplitude, so that the end of each sample could blend seamlessly with the beginning of each subsequent sample. Zero crossings at the beginning of the primary pitch period were identified at two locations separated by approximately 1.5 s. Because the fundamental frequency was always about 115 Hz, a pitch period zero-crossing would always occur within 8.7 ms (278 samples) of 1.5 s. The signal was cut at this point, and then resampled to give exactly 48000 samples for that period of time. These stimuli were then treated as if they had been sampled at 32000 Hz. This resulted in a negligible shift in the average pitch of the voicing of less than ±1 Hz. Stimuli were bandpass filtered between 10 and 4000 Hz using a 1000-point finite impulse response filter with zero phase distortion. Note that all filter cutoff frequencies are listed at the 6-dB down point. Each stimulus was scaled to have equal RMS amplitude. The stimuli differed with respect to peak-to-peak amplitude, which was +4.1 dB (relative to the RMS level) for /ɑ/, +7.5 dB for /i/, and +2.6 dB for /u/.
34
For the second experiment, we recorded an /ʌ/ where the fundamental frequency was periodically lowered (and then re-raised) by 20 Hz. The frequency thus varied between about 120 and 140 Hz. These changes were rapid (approximately 200 Hz/s) and were interposed by 0.5- to 1-second portions of relatively little change in fundamental frequency. A 1.5 second sample was extracted by finding zero crossings corresponding to the primary pitch period separated by approximately 1.5 seconds. The chosen sample included one negative and one positive 20 Hz shift in fundamental frequency. Two additional samples of approximately 750 ms were created by extracting a relatively steady high frequency portion and a relatively steady low frequency portion of the same recording. Each of these samples was then concatenated to be approximately 1.5 s in length. There were thus three versions: changing fundamental („highlow‟), high steady fundamental („high‟) and low steady fundamental („low‟). Each of the three was then resampled to be exactly 1.5 s in length, bandpass filtered (10-6000 Hz), and scaled to have equal RMS amplitude. Maximum peak-to-peak amplitudes (relative to RMS levels) were +5.2 dB for the changing fundamental /ʌ/, +2.3 dB for the low fundamental /ʌ/, and +3.8 dB for the high fundamental /ʌ/. The changing f0 /ʌ/ was also refiltered using a 10-4000 Hz bandpass filter, for the fourth experiment. This filtering did not affect the RMS amplitude of the stimulus, but increased its peak-to-peak amplitude by 0.3 dB. Twelfth-order linear predictive coding (LPC) spectra were computed for the four vowels in Experiments 1 and 2, after downsampling each to 8000 Hz. Spectra were computed over 50-ms windows beginning 500 and 1000 ms after stimulus onset, using the Colea software tool for MATLAB (Loizou, 1998). These spectra are displayed in Figure 2.1. For the third experiment, we recorded three vowels (/ʌ/, /u/, and /i/), spoken in succession without any pause at a fundamental frequency near 104 Hz. Each vowel lasted for approximately 500 ms, and then blended seamlessly into the next vowel. Vocal level and fundamental frequency were held relatively constant. A sample of approximately 1.5 s was extracted and resampled to be exactly 1.5 s in length. This sample included the three vowels, beginning with /ʌ/, and ending with /i/. The sample was then bandpass filtered (10-6000 Hz) and scaled to have an RMS amplitude equal to the stimuli used in the first two experiments (peak-to-peak amplitude was 6.4 dB greater than the RMS level).
35
Figure 2.1. LPC spectra of /ɑ/, /i/, /u/, and /ʌ/. The LPC analyses (12-pole) were conducted on 50 ms windows starting 500 ms (solid line) and 1000 ms (dashed line) into each stimulus, after down-sampling each to an effective sampling rate of 8000 Hz. The first three vowels were used in Experiment 1 and the fourth vowel was used in Experiments 2, 4 and 5.
36
For the fifth experiment, two new stimuli were created by refiltering the changing f0 /ʌ/ stimulus used in the second experiment. One stimulus was highpass filtered at 200 Hz (using a 1000 point finite impulse response filter) to remove the first harmonic. The other stimulus was bandpass filtered between 50 and 200 Hz, leaving only the first harmonic. This was perceived as a lowpitched hum without any phonemic identity.
2.3.3
Creation of the Reference Sinusoids
The Fourier analyzer relates the recorded response to a reference frequency. In the present study, we created references that followed the fundamental frequency of the voice. The energy in the vowel sounds is concentrated at this fundamental frequency (f0) and at integer multiples of this frequency called the harmonics and numbered by the multiple as f2, f3, f4, etc. When the first harmonic (f1) is present, it is identical to the fundamental frequency (f0). However, it is possible to have sounds where f1 is not present and f0 is perceptually inferred, as in the fifth experiment. The first harmonic, initially present in all of stimuli (although it was later removed for Experiment 5), was used to create a reference track for the Fourier analyzer. The initial step in creating this f1 reference was to isolate the first harmonic. Visual inspection of the vowel spectra indicated that this was generally located between 50 and 200 Hz. In order to ensure that only the first harmonic was present, all of the samples were bandpass filtered between 50 and 200 Hz (except the multi-vowel sample used for the third experiment - this was bandpass filtered between 50 and 180 Hz since the mean frequency was lower for this stimulus). Filtering was accomplished with a 1000-point finite impulse response filter with zero phase distortion. The first two rows in Figure 2.2 show the first and last 30 ms of the steady /ɑ/ stimulus before and after this filtering (the bottom three rows are discussed below).
37
Figure 2.2. Creation of “f1” reference sinusoids. The first and last 30 ms of the steady fundamental /ɑ/ stimulus are displayed (top). This stimulus was filtered between 50 and 200 Hz to isolate the first harmonic (second row). The instantaneous phase of this harmonic was calculated as the angle of the Hilbert transform. The instantaneous frequency was then calculated (the first derivative of the unwrapped phase with respect to time), and smoothed to remove any sharp changes resulting from approximate differentiation (third row). Adjacent frequency tracks could then be created by simply transposing the frequency track. The cosine of the cumulative sum of the starting phase and the derivative of instantaneous frequency (with respect to time) produced a normalized reference sinusoid in phase with the stimulus (fourth row). Real (cosine) and imaginary (sine) reference sinusoids were used for the Fourier analyzer, after delaying each by 10 ms (fifth row).
38
The response likely follows f0 (as conveyed by the speech envelope) rather than f1, but the parameters of the envelope (e.g. phase, etc) may not be exactly the same as the signal at the f 1 frequency. This could cause some distortion of the response if it is measured with an f1 reference, particularly when the response is changing its formants. Thus, an envelope-based f0 reference was also created. This reference was intended to match the envelope as it would be represented in the auditory brainstem. Signals in the brainstem must first pass through the cochlea, which introduces a frequencydependent delay that affects the envelope. Estimates of the latency of the traveling wave suggest that a signal requires approximately 3 ms to travel from the 10 kHz region of the cochlea to the 1 kHz region and a further 5 ms to reach the 250 Hz region (Eggermont, 1979; Kimberley et al., 1993; Schoonhoven et al., 2001). In the present study, each stimulus was filtered into 20 thirdoctave bands, with center frequencies ranging from 105 to 8000 Hz. Each band was then delayed by the estimated cochlear delay at the center frequency of the band, determined according to the function derived by Schoonhoven and colleagues (2001) for a 70 dB click: d = 3.25 (c/1000) -.69 where c is the center frequency in Hz, and d is the estimated cochlear delay, in milliseconds. The bands were then summed, and the complete signal was half-wave rectified and filtered at 50-200 Hz (or 50 to 180 Hz for the multi-vowel stimulus). This process is illustrated in Figure 2.3. For both the f1 and f0 references, all subsequent steps were the same. The frequency of the fundamental was derived in the following manner. The Hilbert transform provided the complex representation of the filtered vowel samples, and the instantaneous phase was determined by finding the four quadrant inverse tangent. The instantaneous frequency was then calculated at each point by finding the derivative of the unwrapped phase with respect to time. The result was smoothed to remove any sharp changes introduced by the process of approximate differentiation, using a 500 point boxcar moving average applied three times. The third row in Figure 2.3 shows the first and last 30 ms of the f1 frequency track of /ɑ/.
39
Figure 2.3. Creation of “f0” reference sinusoids. The first and last 30 ms of the steady fundamental /ɑ/ stimulus are displayed (top). This stimulus was bandpass filtered into 3rd octave bands, with each band subjected to frequency-dependent delay to simulate the cochlear traveling wave. The bands were then summed (second row). The envelope was calculated by half-wave rectification (third row), and then filtered between 50 and 200 Hz (fourth row). The normalized signal was then calculated in the same manner as the f1 reference sinusoid (i.e. via the Hilbert transform). This is shown in the fifth row. Real (cosine) and imaginary (sine) reference sinusoids were further delayed by 5 ms to simulate neurophysiologic delay (not shown).
40
Frequency tracks were then created at the second (f2) and third harmonics (f3), by respectively, doubling and tripling the f1 track. Frequency tracks were also created at 2f0 and 3f0 by doubling and tripling the f0 track. Since recorded electrophysiologic responses were low-pass filtered at 300 Hz (and digitized at a rate of 1000 Hz), higher harmonics were not tested. The frequency of the third harmonic often exceeded 300 Hz, so third harmonic responses were partly attenuated by the filter. Adjacent frequency tracks were created to provide a means of measuring the energy at nonstimulus frequencies (i.e. to assess the electrical noise levels). Eight non-stimulus tracks above and below the fundamental and each of the harmonics were created by adding and subtracting a fixed number of cycles per second. These were separated from the fundamental (or harmonic) and from each other by 2 Hz. This spacing was determined in accordance with the resolution of the analyzer (2 Hz, discussed below). Figure 2.4 presents all of the frequency tracks for the steady /ɑ/ sample used in Experiment 1 (Fig.4 left) and the variable /ʌ/ sample used in Experiment 2 (Fig. 4 right). In both cases, the f1 reference tracks are shown. Reference sinusoids were created for each frequency track (i.e. each fundamental or harmonic and adjacent frequency) by calculating the sine and cosine of the instantaneous phase angle. This produced orthogonal sinusoids with unity amplitude, as required by the Fourier analyzer. The last row of Figure 2.2 shows the first and last 30 ms of the f1 reference sinusoids for the steady f0 /ɑ/.
41
Figure 2.4. Frequency track of the f1 reference frequency (think line) and of adjacent frequencies (thin gray lines), for the steady fundamental /ɑ/ stimulus (top left) and for the changing fundamental /ʌ/ stimulus (top right). Rate of change of the instantaneous f1 reference frequency for each stimulus (bottom).
42
2.3.4
Creation of the Fourier Analyzer
The response amplitude along a frequency trajectory can be calculated by multiplying the response by the real and imaginary components of a reference frequency and then integrating the products over time (see Regan, 1989). The integration smoothes out the high-frequency distortion-products from the multiplication. Given that the imaginary component lags the real component by 90˚, the integrated products are effectively sine and cosine projections (x and y) of the measured response. The amplitude (a) and phase ( ) can then be calculated using the following equations:
a x2 y 2
tan1 ( y / x) The length of the integration is inversely related to the frequency resolution of the analysis, such that the maximum resolution is equal to 1/T, where T represents the integration time, in seconds. For instance, an integration period of 1 s can provide a maximum resolution of 1 Hz, whereas an integration period of 0.5 s can provide a maximum resolution of 2 Hz. Integration was accomplished with a 500-point (i.e. 500 ms) boxcar moving average, providing a maximum frequency resolution of 2 Hz. Despite the smoothing of the integration filter, the response is related instantaneously to the reference signal. The Fourier analyzer calculates the response locked to the reference signal as both the reference and the response change from moment to moment, and this calculation precedes integration. This is problematic, because the physiological response is delayed by the time required for the transduction of the signal and the transmission of the signal information to the place where the response is generated. While a delay shorter than the signal period would merely produce a phase shift in the response, a longer delay would attenuate the response. We estimated the probable delay of the response as approximately 10 ms (see Table 1 in Picton et al., 2003). The f1 reference was therefore delayed by 10 ms prior to multiplication to compensate for estimated neurophysiologic delay. For the f0 reference, since the estimated cochlear delay had already been taken into account (with the average cochlear delay being about 5 ms), the reference was only delayed by a further 5 ms. The appendix provides a more detailed discussion
43
of this issue. In the present study, the Fourier analyzer was implemented digitally in MATLAB, and the analysis was conducted offline. Although the Fourier analyzer is able to measure a response to a changing stimulus within a single integration period, it cannot provide any temporal information within this period, since the response is integrated with respect to time. However, in the present study, this integration was accomplished via a moving average, which provided temporal information corresponding to the sample by sample updates. The result of each average was plotted at the center of the average, so as not to introduce any lag in the response.
2.3.5
Procedure
During the recordings, participants were seated in a comfortable reclining chair in a doublewalled, sound-attenuating chamber. They were encouraged to sleep during the recording. Stimulus presentation and data recording were controlled using a modified version of the MASTER system (John and Picton, 2000). Stimuli were DA converted at a rate of 32000 Hz, routed through a GSI 16 audiometer, and presented at a level of 50 dB HL (i.e. 66 dB SPL in a 2cc coupler) through an EAR-Tone 3A insert earphone in the right ear. An EAR earplug was inserted in the left ear. Electroencephalographic activity was recorded between the vertex and the mid posterior neck using a Grass P55 preamplifier with a bandpass of 1-300 Hz. Interelectrode impedances were maintained below 5 kΩ. Responses to one hundred 6-second sweeps were recorded continuously at an AD rate of 1000 Hz for a total recording time of 10 minutes per condition. Each 6-second sweep of the sound was comprised of four identical 1.5-second epochs, which were weighted prior to averaging based on response variance between 70 and 500 Hz (John et al., 2001). In order to ensure that electrically recorded responses were not artifactual, a further condition was added in which data were collected while no sound was presented to participants. In order to avoid waking or disturbing the participants, this was accomplished by switching the audiometer routing to the left channel, which delivered the sound to a plugged EAR-Tone 3A insert earphone lying on the subject‟s chest. During this recording, the EAR earplug remained in the
44
subject‟s left ear, and the EAR-Tone 3A insert earphone (connected to the inactive right channel) remained in the subject‟s right ear.
2.3.6
Analysis
Once the response was recorded, it was necessary to determine whether it was significantly different from the residual background EEG noise. One way to do this is to compare the power of the response at the vowel frequency to the power of the response at adjacent frequencies where no response is expected, using an F statistic (Zurek, 1992; Lins et al., 1996). A second way is based on the variance of the response over successive iterations. If the response recorded at the expected frequency reflects a neural response to the stimulus, the amplitude and phase of that response should remain relatively consistent from trial to trial. This can be assessed using Hotelling‟s T2 (see Picton et al., 2003) or the more powerful circular T2 (assuming equivalent variance in the real and imaginary dimensions; Victor and Mast, 1991). Note that the significance of the T2 statistic does not depend on the response power measured at adjacent frequencies. This is a desirable characteristic, since with a spectrally complex stimulus such as speech, it might not be safe to assume that all adjacent frequency responses are noise. Both approaches were used in the present study. Results were analyzed using a circular T2 test, and a second analysis compared the amplitudes at the voice frequency to the amplitudes at the 16 adjacent frequency tracks (all analyzed using the Fourier Analyzer) by means of an F test with degrees of freedom 2 and 32. The F test in the FFT is essentially the same as the T2 test when the number of adjacent frequency bins in the F-test is equal to one more than the number of response measurements in the T2 test (Dobie and Wilson, 1993), although this may not be true for a Fourier Analyzer with a moving reference frequency. Since the F test was conducted with 16 adjacent frequency bins, while the circular T2 test was conducted with 12 measurements, the F test should have had more power to detect a response in the present analysis. An alpha criterion of 0.05 was selected for all analyses. Results were also analyzed comparatively, using three-factor repeated-measures ANOVAs. The factors were vowel (which varied by experiment), harmonic (first, second or third), and reference (f0 or f1). Greenhouse-Geisser corrections were used where sphericity could not be assumed. Post hoc tests were conducted using paired t tests with appropriate Bonferroni corrections.
45
2.4 Results
2.4.1 2.4.1.1
Experiment 1: Steady Fundamental Frequencies Response Characteristics
Figure 2.5 shows an example of a single subject response to the vowel /ɑ/, using the f1 reference. In Figure 2.5 (top), the amplitude of the response is shown for one entire sweep. In Figure 2.5 (bottom), the phase of this response is shown. The amplitudes and phases calculated along the 16 frequency trajectories adjacent to the fundamental are presented as well. The amplitude of the response at the fundamental (dark black line) is much larger than the amplitudes at the other frequencies (thin gray lines). The phase of the response at the fundamental is also constant, while the phases at the adjacent frequencies fluctuate wildly. Mean response amplitudes at the fundamental and at the adjacent frequencies were determined using the Fourier analyzer for each participant. Table 2.1 presents the mean and standard deviation of the amplitude at the fundamental, as well as the grand mean and standard deviation of all adjacent frequency amplitudes, for each stimulus. Values are displayed for both the f1 reference and the f0 reference. Responses are presented for the /ɑ/, /i/, and /u/ vowels, as well as for the condition in which no sound was presented. In the latter condition, „fundamental frequency‟ refers to the vowel stimulus delivered to the plugged insert earphone. Note that the mean amplitude in this condition was comparable to the amplitude of responses at the adjacent frequencies (noise estimates). This supports the conclusion that the large responses at the fundamental for the /ɑ/, /i/ and /u/ stimuli were true auditory responses, and not electrical artifact. The incidence of adjacent-frequency responses (noise estimates) recognized as significant on the circular T2 test was not significantly different (as determined using a χ2 test) from the expected number based on the chosen alpha criterion (5%).
46
Figure 2.5. Amplitude (top) and phase (bottom) of electrophysiologic response, for a single subject, to the steady f0 /ɑ/ stimulus. Thick line is response at fundamental (f1 reference), thin lines are responses at adjacent frequencies. Note that the response at the fundamental has a much higher amplitude than the responses at adjacent frequencies. The phase of the response at the fundamental is also relatively steady, whereas response phase at the adjacent frequencies fluctuates wildly.
47
Table 2.1. Results of Experiment 1.
Mean amplitudes (and standard deviations) at the
fundamental frequency and across all adjacent frequencies. Results are shown for the f1 reference and the f0 (envelope) reference. Also included are the percentage of significant responses (as determined using an F statistic and a circular T2 statistic), as well as the minimum, average and maximum recording time required to elicit a significant response (p < 0.05).
48
Figure 2.6 presents responses in polar form. Twelve estimates of amplitude and phase were extracted from each averaged 6 second sweep at 500 ms intervals. This spacing was chosen to ensure that the estimates were independent of each other, since the moving average used for integration was 500 ms in length. Also presented are the 95% confidence limits of the mean response, which are used for the circular T2 test. When the confidence limits do not include the origin, the response is considered to be significant. Figure 6 presents the responses for one subject (Fig. 6 top row) and the 95% confidence limits of the mean responses for all subjects (Fig. 6 middle row). All responses were clearly significant at p < 0.05 (see Table 2.1). All responses for all subjects were also significant at p < 0.05 using the F test. For the „no sound‟ condition, the 95% confidence limits of the means for all subjects are also presented in Figure 6b. No responses were significant for this condition for any of the subjects (also see Table 2.1), suggesting that there was no significant electrical artifact. This was true for both the circular T2 test and the F test. An identical pattern of results was found when the reference sinusoid was based on the envelope (f0) reference instead of f1. Response amplitudes showed a significant interaction between vowel and reference (F(3,27) = 67.49, p < 0.001). Post hoc tests showed that response amplitudes were higher for /ɑ/ when the f1 reference was used. There was also a significant interaction between vowel and harmonic (F(6,54) = 22.43, p < 0.001). Post hoc tests indicated that the responses to /ɑ/, /i/, and /u/ were higher than the responses to the no sound condition. Also, responses to /i/ were higher than responses to /ɑ/ at the first and second harmonic, and higher than responses to /u/ at the first harmonic. Response amplitude also decreased at the second and third harmonics.
49
Figure 2.6. Polar plots of one participant‟s responses for Experiment 1, with 95% confidence intervals superimposed (top). 95% confidence intervals for all subjects for Experiment 1 (middle), and for all subjects for Experiments 2 and 3 (bottom).
50
2.4.1.2
Recording Time
In the present study, results were based on a total of 100 individual 6-second sweeps, each comprising 4 identical 1.5 second epochs. In order to determine the time required to generate a significant response, the Fourier analyzer was applied iteratively to cumulatively averaged sweeps. Response significance was quantified after each sweep was added to the cumulative average, using the circular T2 test with an alpha of 0.05. The number of sweeps that were included in the average when a significant result was first found was then recorded. Table 2.1 presents the minimum, average and maximum time required (i.e. number of sweeps x 6 s) to elicit a significant response for each of the subjects and stimuli. A repeated-measures ANOVA was conducted on the (log-normalized) time required to recognize a significant result. A three-way interaction of vowel, harmonic and reference was found (F(4,12) = 4.30, p < 0.05). Post hoc testing indicated that it took longer to elicit a significant response to the /ɑ/ vowel at the third harmonic when the f0 reference was used instead of the f1 reference.
2.4.2
Experiment 2: Changing Fundamental Frequencies
Since natural speech often varies in pitch, the second experiment looked at the ability of the Fourier Analyzer to measure responses to a stimulus with a fundamental frequency that varied more than in the first experiment. For comparison purposes, responses to relatively steady-state portions of the same vowel stimulus were also recorded.
2.4.2.1
Response Characteristics
Mean response amplitudes at f0 and at the adjacent frequencies were determined using the Fourier analyzer for the three /ʌ/ stimuli (high-low, high, and low). Table 2.2 presents the means and standard deviations at the fundamental and at all adjacent frequencies for each subject, using both references.
51
Table 2.2. Results of Experiment 2 and 3 at the f1 reference (i.e. the first harmonic) and the f0 reference (i.e. the envelope). The data are organized as in Table 2.1.
Table 2.3. Results of Experiment 2 and 3, at second and third harmonics. The data are organized as in Table 2.1.
52
Individual subject responses to each of the stimuli were also evaluated for significance. Figure 2.6 (bottom row) presents the 95% confidence intervals of the mean for all subjects, in polar form. As in the first experiment, the incidence of adjacent-frequency responses recognized as significant was not significantly different (as determined using a χ2 test) from the expected number based on the chosen alpha criterion. According to the F test, all subject responses at the fundamental were significant for all of the stimuli, regardless of whether the f1 or f0 reference was used. In contrast, the circular T2 test detected significant responses for all ten subjects for the two steady versions of /ʌ/, and for 9 subjects for the variable fundamental /ʌ/. Responses at f2 and f3 are presented in Table 2.3. The F test detected significant responses for all subjects to all of the vowels at both of these harmonics, whereas the circular T2 test only detected significant responses in 90-100% subjects. We wanted to determine whether there was a significant difference between responses to the changing frequency vowel and responses to steady portions of the same vowel. A repeatedmeasures ANOVA revealed a significant interaction of vowel and reference (F(2,18) = 14.89, p < 0.001), and a main effect of harmonic (F(2,18) = 27.58, p < 0.001). Post hoc tests at the f0 reference indicated that responses to the changing frequency /ʌ/ were lower than responses to the low and high frequency steady portions of the same vowel. At the f1 reference, responses to the changing frequency /ʌ/ were lower than the high frequency steady /ʌ/, but not significantly difference from the low frequency steady /ʌ/. Response amplitudes also decreased significantly at the second and third harmonics (see Table 2.3).
2.4.2.2
Recording Time
A repeated-measures ANOVA was conducted on the (log-normalized) time required to elicit a significant result, but no main effects or interactions were found. Thus, although the response amplitude was lower for the variable fundamental /ʌ/, all of the responses were significant according to the F test, and the time required to recognize a significant result was not significantly different.
53
2.4.3
Experiment 3: Changing Vowels
The Fourier analyzer was clearly able to detect responses to a vowel with a changing fundamental frequency, but this is only one component of the variability that exists in natural speech. Another source of variability in speech is formant movement. Formants determine which harmonics are most audible, and thus determine the identity of the vowel. This experiment evaluated the response to a speech sound that varied among three different vowel identities.
2.4.3.1
Response Characteristics
Mean response amplitudes at the fundamental frequency and at the adjacent frequencies were determined using the Fourier analyzer for the three-vowel stimulus (/ʌui/). Table 2.2 (columns 4 and 8) presents the mean amplitudes at the fundamental and at all adjacent frequencies. Using the f1 reference, all but one of the responses were significant, according to the circular T2 test and the F test. The same result was found with the circular T2 test and the f0 reference. However, all of the responses were found to be significant with the F test and the f0 reference. Figure 2.6 (bottom right) presents the 95% confidence intervals of the mean for all subjects, in polar form. When using the f1 reference, the incidence of significant responses at adjacent frequencies (using the T2circ statistic) was significantly higher than the expected number of false-positive responses (χ2 = 5.6, p < 0.05). Given the spectral complexity of this stimulus (i.e. the rapid formant changes), it is possible that significant responses to adjacent frequencies were not false positive responses, but true responses to the stimulus, reflecting a wider distribution of response energy. Although sixteen adjacent frequency tracks were tested (eight above and eight below the stimulus), all significant adjacent frequency responses occurred in the eight adjacent frequency bands nearest to the fundamental, and were approximately normally distributed. These responses were thus likely related to the stimulus, and should not be considered to be false positives. When using the f0 reference, the number of significant responses at adjacent frequencies was reduced to 8.1% which is not significantly different than the expected number of false positive responses. The responses to /ʌui/ using the f1 reference and the f0 reference are shown in Figure 2.7, for all frequency tracks.
54
Figure 2.7. Average response amplitude plotted as a function of frequency and time. Amplitude is represented by color warmth, with scale indicated for each row (on left). Average responses (all subjects) to the multivowel stimulus /ʌui/ are displayed with respect to the f1 reference (left) and f0 reference (right). Note that the response amplitude is lowest for the portion of the stimulus where the f0 reference is frequency modulated.
55
Using the envelope-based reference reduced the number of adjacent frequency responses, and increased the number of responses detected at the fundamental. This suggests that the responses were truly envelope FFRs. However, there were still more false positives than found in other conditions, and they were distributed in the eight frequency bins surrounding the fundamental, suggesting that either the response energy was distributed more widely to this stimulus than to the single-vowel stimuli, or that the envelope reference did not match the activity in the brainstem closely enough. Because the same subjects participated in the second and third experiments, it was possible to analyze the results from both experiments using a single repeated-measures ANOVA. This revealed an interaction between vowel and reference (F(3,27) = 12.40, p < 0.001), with post hoc testing indicating that the response to /ʌui/ was higher than the response to the changing fundamental /ʌ/ (using the f0 reference). There was also a main effect of harmonic (F(2,18) = 50.05, p < 0.001), with the first harmonic being higher than the second, and the second being higher than the third.
2.4.4
Experiment 4: Effects of Bandwidth and Vowel Identity
Mean response amplitudes in Experiments 2 and 3 were generally lower than in Experiment 1. This was an unexpected result, given that the high and low /ʌ/ vowels in Experiment 2 were relatively steady vowels, like the steady /ɑ/ in Experiment 1. One possible explanation is wider stimulus bandwidth for the stimuli in Experiment 2 (10-6000 vs. 10-4000 Hz in Experiment 1). Since the stimuli were normalized to have equal RMS amplitude, the stimulus with a smaller bandwidth (/ɑ/) might have had more energy in the frequency region below 4000 Hz, where the most important formant energy is located. Alternatively, the information between 4000 and 6000 Hz in the /ʌ/ stimulus might have reduced response amplitude if it was not phase-locked to the fundamental. Auditory nerve fibers with characteristic frequencies above the second formant usually phase lock very well to the fundamental (Schilling et al., 1998; Delgutte and Kiang, 1984), and could have been prevented from doing so if something non phase-locked in the stimulus was dominating their responses. Another possible explanation is the different vowel identity (/ɑ/ in Experiment 1 vs /ʌ/ in
56
Experiment 2). Figure 2.1 shows the vowels‟ formant locations, which are responsible for their identities. Although the second formants are similar, the first formant of /ʌ/ is much lower than the first formant of /ɑ/. It is possible that some vowels elicit stronger envelope following responses than others. Evidence for this was provided in Experiment 1, where responses were significantly higher to the /i/ stimulus than to the /ɑ/ stimulus. To test these alternatives, responses were recorded to the steady fundamental /ɑ/ used in Experiment 1 (100-4000 Hz bandwidth), the changing fundamental /ʌ/ used in Experiment 2 (100-6000 Hz bandwidth), and a band-limited version of the changing fundamental /ʌ/ (1004000 Hz bandwidth), with unchanged RMS amplitude. If the larger amplitude was related to the smaller bandwidth, the band-limited /ʌ/ should have elicited larger responses. If the difference was due to different vowel identities, the band-limited /ʌ/ should have elicited responses similar to the full bandwidth /ʌ/. Results are shown in Table 2.4, using the f1 reference. Mean response amplitude for the bandlimited /ʌ/ was lower than for the full bandwidth /ʌ/, and both were much lower than for the /ɑ/ vowel. A repeated-measures ANOVA revealed a significant interaction of vowel and harmonic (F(4,28) = 15.42, p < 0.005) and a significant interaction of vowel and reference (F(2,14) = 80.20, p < 0.001). Post hoc testing indicated that at the first harmonic, the responses to /ɑ/ were significantly higher in amplitude than responses to the full-bandwidth /ʌ/ and the restricted bandwidth /ʌ/. Also, at the first and second harmonics, responses to the full-bandwidth /ʌ/ were higher than responses to the restricted-bandwidth /ʌ/. Thus, the smaller bandwidth decreased rather than increased response amplitude. The f1 reference performed slightly better than the f0 reference for these stimuli, although the pattern of results was identical for both references.
57
Table 2.4. Results of Experiment 4 (at f1 reference). Results are organized as in Table 2.1.
Table 2.5. Results of Experiment 5. Results are organized as in Table 2.1.
58
2.4.5
Experiment 5: Envelope or Frequency Following?
The results of Experiment 3 provided evidence that the recorded responses at the fundamental were following the stimulus envelope, not the first harmonic. In the multi-vowel stimulus, the first harmonic was steady while the higher formants were changing. This resulted in an increase in adjacent frequency responses, and a decrease in responses at the fundamental. These changes should not have occurred if the response was following the first harmonic. Additionally, use of an envelope-based reference decreased the number of adjacent frequency responses and increased the number of responses at the fundamental. However, since the first harmonic was present in all of the stimuli, it is still possible that the responses were frequency following, and not envelope following. In order to test this, the changing fundamental /ʌ/ was presented alternatively with no first harmonic (“no f1”), or only the first harmonic (“f1-only”). If the responses were frequency following responses, removal of the higher harmonics should not have reduced the response amplitude. If the responses were envelope following responses, removal of the first harmonic should not have reduced the response amplitude. Results are presented in Table 2.5. It is clear that response amplitude was higher in the no-f1 condition than in the f1-only condition, regardless of which reference was used. Figure 2.8 shows response amplitude in the f1-only condition (Fig. 2.8 left) and in the no-f1 condition (Fig. 2.8 right). And while all of the responses to the no-f1 stimulus were significant, only 50-65% of the responses to the f1-only condition were significant. A repeated-measures ANOVA revealed a significant interaction between vowel and reference (F(1,15) = 54.56, p < 0.001), with post hoc tests finding that response amplitude was significantly higher to the no-f1 stimulus than to the f1only stimulus, for both references. A main effect of harmonic was also found (F(2,30) = 39.25, p < 0.001) as expected. Although the results indicate that the responses primarily followed the envelope, it is of interest as to whether the removal of the first harmonic had any significant effect on the responses. It is possible that the first harmonic made a significant contribution to the responses in conjunction with the stimulus envelope. Since all of the participants in Experiment 5 also participated in Experiment 4, responses to the no-f1 /ʌ/ in Experiment 5 were compared with responses to the full-bandwidth /ʌ/ in Experiment 4, using a single repeated-measures ANOVA. Results showed no effects of removing the first harmonic on response amplitude or recording time.
59
Figure 2.8. Average response amplitude plotted as a function of frequency and time. Amplitude is represented by color warmth, with scale indicated for each row (on left). Average responses (all subjects) to the f1-only /ʌ/ (left) and no-f1 (i.e. missing fundamental) /ʌ/ (right) are displayed.
60
2.5 Discussion Overall, our results support the use of a Fourier analyzer to detect envelope following responses to the fundamental frequencies of vowels.
Significant responses were detected at the
fundamental for all of the stimuli (and for all of the subjects), even when the fundamental frequency or the vowel identity was changing throughout the stimulus. When sound was presented to a plugged insert earphone instead of the ear, there were no significant responses, indicating that the results were not due to electrical artifact. Significant responses were detected at the fundamental in less than 1.5 minutes (on average), for all stimuli (excepting the f 1-only stimulus used in Experiment 5).
2.5.1
Effects of Vowel Identity
Although all responses at the fundamental were significant, response amplitude varied significantly between different vowels. Responses to /i/ were significantly higher in amplitude than responses to /ɑ/ and /u/, and responses to /ɑ/, /u/, and /i/ were much higher than responses to /ʌ/. These differences were not related to the intensity of the sounds since they were all presented at the same RMS intensity; furthermore the differences did not follow the peak stimulus levels (the peak-to-peak amplitude of /u/ was lower than that of the variable fundamental /ʌ/ and the high fundamental /ʌ/). Interestingly, the three vowels that tended to elicit high amplitude responses (/ɑ/, /u/, and /i/) happen to be the vowels at the most extreme points of articulation. For instance, /i/ is produced with the tongue in the highest and most frontal position, /u/ is produced with the tongue in the highest and most rearward position (with lips rounded), and /ɑ/ is produced with the tongue in the lowest and most rearward position. It would be of interest to see whether central vowels (such as /ʌ/) always elicit lower amplitude responses as compared with vowels at articulatory extremes, but it is not obvious why this would occur. A closer look at formant structure (using LPC spectra; see Figure 2.1) suggests another possible account for the effects of vowel on response amplitude. Since formants carry most of the energy in vowels, the envelope FFRs likely followed the envelope modulation at the formants. However, due to cochlear delay, the neural response at each of the formants would not have occurred at the same time. Fundamental envelope modulation at these formants would then have been out of
61
phase, and the population response of the neurons following the envelope would have been smaller. The first formant (F1) for /ʌ/ (i.e. the vowel that produced the lowest response amplitude) was approximately 570 Hz, and the second formant (F2) was about 1250 Hz. Using the estimate of cochlear delay provided by Schoonhoven and colleagues (2001), the neural response to F1 would have occurred about 2 ms after the neural response to F2. At a frequency of 115 Hz (the approximate fundamental frequency of the /ʌ/), this time difference corresponds to a 90˚phase shift. Thus, neurons contributing to the envelope following response at F1 were likely out of phase with similar neurons at F2. If envelope modulations at F1 and F2 contributed equally well to the overall envelope following response, this phase shift could have diminished the overall response amplitude by roughly 30% (although the relative contributions are unknown). In contrast, F1 for /ɑ/ was about 795 Hz, and F2 was about 1195 Hz, so the response at F1 would have occurred less than 1 ms after the response at F2. This phase difference is less than 45˚ at the fundamental, so the overall response amplitude would have been greater (than the /ʌ/), which is what was found. For the /i/, F1 was about 235 Hz, and F2 was about 2580 Hz, so the response at F1 would have occurred just over 7 ms after the response at F2. This corresponds to a phase difference of approximately 300˚ (or -60˚), which should have had a small negative effect on amplitude. However, responses to /i/ were significantly higher than responses to any of the other vowels. It is possible that the very high F2 frequency for this vowel reduced or eliminated its contribution to the envelope FFR, precluding any phase mismatch effects. For the /u/, F1 was about 220 Hz, and F2 was about 945 Hz, so the response at F1 would have occurred about 6 ms after the response at F2. This roughly corresponds to a 270˚ lag. Since a 270˚ lag is equivalent to a 90˚phase shift, the phase interaction should have diminished the amplitude of the response by about 30% (as with /ʌ/). However, responses to /u/ were significantly higher than responses to /ʌ/. This may be because the /u/ F2 was about 25 dB lower than the /u/ F1, which may have reduced its contribution to the envelope following response. The position of the formants can therefore account for the effects of vowel identity on response amplitude, if it is assumed that only the lower parts of the spectrum (e.g. below 2000 Hz) contributed to the envelope FFR. Where a single formant dominated the lower part of the
62
spectrum (e.g. /i/ and /u/), or where there were two low-frequency formants whose envelope periodicities were similar in phase (e.g. /ɑ/), the response was higher than where there were two low-frequency formants whose envelope periodicities were less similar in phase (/ʌ/). Phase effects can also account for the results even if high frequency formants contributed equally well to the envelope following response, if we use an alternative estimate of cochlear delay. The estimate of cochlear delay in Eggermont (1979) is slightly larger at 250 Hz (9.92 ms) than the estimate of Schoonhoven and colleagues (8.09 ms; 2001). If Eggermont‟s estimate is taken, the envelope periodicities at F1 and F2 for /i/ (which produced the highest responses) would have been almost perfectly in phase. The F1-F2 envelope periodicity phase differences for /u/ and /ɑ/ (which had equivalent amplitudes that were lower than /i/) would both have been approximately 45˚, and the F1-F2 envelope periodicity phase differences for /ʌ/ (which had the lowest amplitude) would have been just over 90˚. Thus, using the cochlear latency estimates from Eggermont (1979), the vowel effects on response amplitude can be explained solely on the basis of phase differences in the fundamental envelope modulation carried by the first two formants. The present results cannot distinguish between these accounts. Also, since the same vowel tokens were used throughout the experiments, it is possible that response amplitude differences were related to accidental features of the stimuli (i.e. other than bandwidth, RMS amplitude or speaker). These results should therefore not be generalized to other tokens of the same vowels. The effects of vowel identity remain to be explored in future studies.
2.5.2
Envelope or Frequency Following?
The results of Experiments 3 and 5 strongly suggest that the responses at the fundamental followed the speech envelope, and not the first harmonic. In Experiment 3, a multi-vowel stimulus was presented, with a nearly constant fundamental frequency. The first harmonic thus changed very little during the stimulus, while the envelope varied with formant movement. The fact that the number of responses at adjacent frequencies increased suggests that the responses were following the variable envelope. In Experiment 5, when only the first harmonic was presented, the response amplitude was drastically reduced, the number of significant responses was reduced, and the time required to
63
elicit a significant result was increased. In contrast, the removal of the first harmonic did not have any significant effect on response amplitude, the number of significant responses, or the time required to elicit a significant result. A number of studies (e.g. Galbraith, 1994; Greenberg et al., 1987) suggest that the response to a missing fundamental stimulus is not related to the envelope, but rather to the frequency of the perceived pitch (although see Hall, 1979 and Chambers et al., 1984, for the opposing view). Galbraith (1994) varied the phase of the middle frequency in a three-harmonic complex stimulus (with no first harmonic). When the middle component was shifted by 90 degrees, the stimulus envelope modulations were greatly reduced, without a corresponding reduction in the response at the fundamental. Greenberg and colleagues (1987) performed similar manipulations in an earlier study, and found similar results. Their study also presented inharmonic stimuli made up of only odd harmonics of a missing fundamental. Spectral analysis of the response found energy at the stimulus envelope, although measurements of the intervals between waveform peaks suggested that the primary frequencies in the response corresponded most closely to “pseudo-periods” created by interactions between stimulus components (which are similar to the perceived pitches). Thus, the robust responses in the no-f1 condition might not have been following the stimulus envelope, but rather the stimulus pitch – inferred from the higher harmonics. However, it is difficult to rule out the existence of envelope following responses to these complex stimuli, since there is ample evidence supporting the existence of envelope following responses to stimuli with similar carrier and modulation frequencies (review: Picton et al, 2003). The main finding that has cast doubt on the envelope following nature of these responses is that the responses are not eliminated when envelope modulation depth is greatly reduced (by shifting component phases). However, the relative phases of the stimulus components are greatly altered when the stimulus passes through the cochlea, since it introduces a frequency-dependent delay. The envelope modulation depth at the level of the auditory brainstem is therefore different than the envelope modulation depth in the stimulus (i.e. when the components contributing to the envelope are distributed in frequency). Therefore, the envelope following response need not be greatly reduced when stimulus-component phase relationships happen to reduce stimulus envelope modulation depth.
64
Thus these responses may have been “frequency following” in the sense of following a constructed pitch, or “envelope following” in the traditional sense. Perhaps a reference frequency based on a neurophysiologically inspired pitch tracking algorithm would have produced better results than the envelope-based reference (e.g. Dajani et al., 2005a, Krishnan, 2002). This question remains to be explored. What is clear is that the responses at the fundamental were not following the first harmonic, but rather the fundamental as represented by higher frequency speech components.
2.5.3
Choice of Reference Frequency
A nearly identical pattern of results was found using the envelope-based (f0) reference and the first harmonic-based (f1) reference. In general, the f1 reference tended to produce higher amplitude responses than the f0 reference, although this difference was slight. However, for the multi-vowel stimulus, the f0 reference produced better results (i.e. fewer adjacent frequency responses and more fundamental frequency responses). Formants contain most of the energy in voiced speech, and thus play a large role in determining the shape of the speech envelope. When formants are moving, the frequency components contributing most strongly to the envelope are also changing. Since the cochlea introduces a frequency dependent delay that alters the phase of each component, the formant movement affects the sum phase of the response. The delay is estimated to be roughly 3 ms between 10 kHz and 1000 Hz and 5 ms between 1000 and 250 Hz (Schoonhoven et al., 2001). For an envelope frequency of 100 Hz (close to the fundamental frequency of the multi-vowel stimulus), these delays correspond to ⅓ and ½ of the stimulus period, respectively. Thus, the phase of the response might change appreciably with a shift in the set of frequencies contributing most strongly to the envelope. Phase and frequency are intimately intertwined. A changing phase produces a corresponding perturbation in the frequency. For instance, an increasing phase lag produces a downward shift in frequency, since the phase angle must rotate at a rate slightly less than 2πf/s (i.e. where f is the original frequency). This phenomenon is widely known as the Doppler Effect (Doppler, 1842). These phase effects are also the basis for constructing frequency-modulated signals (Hartman,
65
1997). The changing phase of the envelope caused by the changing formant frequencies (coupled with the frequency-dependent cochlear delay) likely caused a change in the frequency of the response. Since the f0 reference was constructed by calculating the envelope of the signal after applying the estimated cochlear (frequency-dependent) delay, the reference should have followed the envelope frequency changes introduced by formant movement, whereas the f1 reference only followed the first harmonic and could not have followed any envelope frequency changes. Thus, due to the combination of formant movement and cochlear delay, the frequency of the envelope (represented by the f0 reference) would have been different from the frequency of the first harmonic (represented by the f1 reference). If the brainstem response was following the first harmonic, the f1 reference would be expected to produce better results than the f0 reference, which did not occur. Instead, the f0 reference produced better results (100% of fundamental responses were significant, and the number of significant adjacent frequency responses was not significant), indicating that the response followed the envelope.
2.5.4
Limitations
Although the Fourier analyzer was effective in measuring responses to natural speech, there are several limitations to this approach as a means of verifying speech discrimination at the level of the brainstem. The first limitation is associated with the use of the fundamental frequency. We used the fundamental frequency as the reference frequency for the analyzer, because the fundamental frequency is a robust element in speech (i.e. all harmonic components are modulated at the fundamental). Whereas the amplitude of individual speech harmonics varies depending on formant location, the fundamental modulation should be present as long as at least one harmonic is audible. However, this robustness decreases the informative value of the test. A response detected at the fundamental frequency could be mediated by any number of harmonics, and might be detectable even in the presence of significant hearing loss. For example, a subject with a high frequency hearing loss might detect the fundamental pitch of the voice but not recognize the actual vowel because of an inability to hear the second harmonic. However, since the response follows the envelope, which is partly carried by high frequency harmonics,
66
someone with a hearing loss (even with normal low frequency hearing) would likely have a very small response (cf. the response to the f1 only stimulus in Experiment 5). Nevertheless, it would be useful to know which harmonics are mediating a response. This would be difficult to determine without modifying the speech (e.g. selectively adding and removing harmonics), but it would be preferable to avoid such modification, since a modified signal might produce an atypical response in a hearing aid or the peripheral auditory system. A better solution might be to analyze the responses to the harmonics themselves. The present study successfully measured responses to the second and third harmonics (more was not possible due to the filter settings of the recording), and other studies have measured responses at higher harmonics (e.g. Krishnan, 2002). This issue was addressed in a subsequent study, which is the subject of the next chapter. A second limitation is that this approach is only directly applicable to voiced speech. Much of the important information in speech is contained in relatively soft and noisy high-frequency consonants, of which many are non-voiced. It is just as important to verify the transmission of this high-frequency consonantal information as it is to verify the transmission of voiced speech. Further work is required to determine if there are effective ways of measuring brainstem responses to these unvoiced sounds in natural speech input. Perhaps a system can be set up to measure separate brainstem responses to noise bursts, stops and voiced vowels.
2.5.5
Conclusion
Natural speech is an ideal stimulus for testing the auditory system, especially when the goal is to ensure that speech information has been transmitted and processed by the auditory system (and possibly also a hearing aid). Envelope following responses to vowels were analyzed using a Fourier analyzer instead of an FFT, so that the natural pitch variations in the vowels would not reduce the signal-to-noise ratio of the response. We measured significant responses for all vowels with steady and variable pitches, as well as for a multi-vowel stimulus, in less than 1.5 minutes (on average). Results support the use of a Fourier analyzer to measure responses to natural speech.
67
3
Envelope and Spectral Frequency Following Responses to Vowel Fundamental Frequency, Harmonics and Formants
3.1 Abstract Frequency-following responses (FFRs) were recorded to 2 naturally produced vowels (/a/ and /i/) in normal hearing subjects. A digitally implemented Fourier analyzer was used to measure response amplitude at the fundamental frequency and at 23 higher harmonics. Response components related to the stimulus envelope (“envelope FFR”) were distinguished from components related to the stimulus spectrum (“spectral FFR”) by adding or subtracting responses to opposite-polarity stimuli. Significant envelope FFRs were detected at the fundamental frequency of both vowels, for all of the subjects. Significant spectral FFRs were detected at harmonics close to formant peaks, and at harmonics corresponding to cochlear intermodulation distortion products, but these were not significant in all subjects, and were not detected above 1500 Hz. These findings indicate that speech-evoked FFRs follow both the glottal pitch envelope as well as spectral stimulus components.
68
3.2 Introduction Infants with hearing impairment detected by neonatal hearing screening are referred for hearing aids within the first few months of age. Fitting is mainly based on thresholds obtained by electrophysiological measurements. However, these measurements are not exact, and it would be helpful to have some way of assessing how well the amplified sound is received in the infant‟s brain (Picton et al., 2001). Speech stimuli would be optimal because the main intent of amplification is to provide the child with sufficient speech information to allow communication and language learning. Speech sounds elicit both transient and sustained activity in the human brainstem. Transient responses are usually recorded with a consonant-vowel diphone stimulus. The speech-evoked auditory brainstem response evoked by /da/ has been used to investigate the brainstem encoding of speech in various populations (e.g. children with learning problems, Cunningham et al., 2001; King et al., 2002; children with auditory processing problems, Johnson et al., 2007), but not in children or adults wearing hearing aids. Longer latency transient responses have also been used in subjects with hearing impairment and with hearing aids (Billings et al., 2007; Golding et al., 2007; Korczak et al., 2005; Rance et al., 2002; Tremblay et al., 2006) though these are more variable in morphology – especially in infants (Wunderlich and Cone-Wesson, 2006). Most commercial hearing aids exhibit sharply non-linear behavior designed to preferentially amplify speech and attenuate other sounds. As a result, hearing aid gain and output characteristics are different for speech and non-speech stimuli, and different for transient and sustained stimuli. We have therefore been considering the use of sustained speech stimuli such as vowel sounds (Aiken and Picton, 2006) or even sentences (Aiken and Picton, 2008). Sustained speech stimuli can evoke a variety of potentials from the cochlea to the cortex. Since cortical potentials in infants are variable and change with maturation, a reasonable approach might be to measure frequency-specific brainstem responses to speech stimuli presented at conversational levels. Brainstem responses to sustained speech stimuli have been called envelope-following responses (Aiken and Picton, 2006), and frequency-following responses (Krishnan et al., 2004). Auditory „steady-state responses,‟ envelope-following responses when the envelope does not change over time, have also been recorded in response to speech-like modulations (Dimitrijevic et al., 2004).
69
Although frequency-following responses are sometimes distinguished from envelope-following responses (e.g. Levi et al., 1995), the term „frequency-following response‟ has been used to describe responses to speech formants (Plyler and Ananthanarayan, 2001), intermodulation distortion arising from two-tone vowels (Krishnan, 1999), speech harmonics (Aiken and Picton, 2006; Krishnan, 2002), and the speech fundamental frequency (Krishnan et al., 2004), which presumably relates to the speech envelope. Thus the term „frequency following response‟ (FFR) can be used in a general sense – denoting a response that follows either the spectral frequency of the stimulus or the frequency of its envelope. For the purposes of this paper, we shall refer to “spectral FFR” and “envelope FFR.” For simplicity we shall restrict the term FFR to responses generated in the nervous system, and not include the cochlear microphonic or stimulus artifact, even though these do follow the spectral frequencies of the stimulus. An important difference between spectral and envelope FFR is that the latter is largely insensitive to stimulus polarity, much like the transient auditory brainstem response (Krishnan, 2002; Small and Stapells, 2005). Spectral FFR can thus be teased apart from the transient response by recording responses to stimuli presented in alternate polarities, and averaging the difference between the responses (Huis In‟t Veld, 1977; Yamada, 1977). Other researchers have averaged the sum of responses to stimuli presented in alternate polarities, in order to separate the FFR from the cochlear microphonic (e.g. Cunningham et a., 2001; King et al., 2002), but this manipulation likely distorts (and may eliminate) the spectral FFR, preserving only those aspects of the FFR locked to the stimulus envelope (Chimento and Schreiner, 1990). Speech FFRs may be ideal for evaluating the peripheral encoding of speech sounds, since they can be evoked by specific elements of speech (e.g. vowel harmonics; Aiken and Picton, 2006; Krishnan, 2002). FFRs may be evoked by several separate elements of speech. One is the speech fundamental – the rate of vocal fold vibration. The other is the harmonic structure of speech. Voiced speech has energy at the integer multiples of the fundamental frequency, which are selectively enhanced by formants (resonance peaks created by the shape of the vocal tract). Responses to harmonics may thus provide information about the auditbility of the formant structure of speech.
70
3.2.1
Responses to the Fundamental
Frequency-following responses to the speech fundamental frequency should be relatively easy to record, since speech is naturally amplitude modulated at this rate by the opening and closing of the vocal folds. Although the amplitude envelope does not have any spectral energy of its own, energy at the envelope frequency is introduced into the auditory system as a result of rectification during cochlear transduction. In an earlier study (Aiken and Picton, 2006 – see chapter 2), we recorded responses to the fundamental frequencies of naturally produced vowels with steady or changing fundamental frequencies. We used a Fourier Analyzer to measure the energy in the each response as the fundamental frequency changed over time (followed a „trajectory‟).
When the frequency
trajectory of a response can be predicted in advance, the Fourier Analyzer can provide an optimal estimate of the response energy along that trajectory. This is in contrast to traditional windowed signal processing techniques (e.g. the short-term Fast Fourier Transform), which assume that a response does not change its frequency within each window (is „stationary‟). With the Fourier Analyzer, significant responses were recorded in all of the subjects, and the average time required to elicit a significant response varied from 13 to 86 seconds. Other techniques have also been used to evaluate the fundamental response. Krishnan et al. (2004) recorded frequency-following responses to Mandarin Chinese tones with changing fundamental frequencies, using a short-term autocorrelation algorithm. Dajani and Picton (2005a) used a filterbank-based algorithm inspired by cochlear physiology to analyze responses to speech segments with changing fundamental frequencies. Both techniques can measure the frequency trajectory well, but neither accurately estimates the response energy.
3.2.2
Responses to Harmonics
Although responses to the fundamental can be measured quickly and reliably, such responses provide limited information about the audibility of speech in different frequency ranges. Since all energy in voiced speech is amplitude modulated at the fundamental frequency, a response at the fundamental could be mediated by audible speech information at any frequency, and thus any place on the basilar membrane.
71
In order to measure place-specific responses to speech, it might be best to record responses directly to the harmonics of the fundamental frequency. Using the Fourier Analyzer (Aiken and Picton, 2006), we recorded significant responses to the second and third harmonics of vowels with steady and changing fundamental frequencies. We did not measure responses to higher harmonics, due to the limited bandwidth of the electroencephalographic recording (1-300 Hz). Krishnan (2002) recorded wide-band responses to synthetic vowels with formant frequencies below 1500 Hz (i.e. back vowels) using a Fast Fourier Transform. Since the frequencies in the synthesized stimuli were stationary, the Fast Fourier Transform would have provided an optimal estimate of the response energy in each frequency bin. Responses were detected at harmonics close to formant peaks, and at several low-frequency harmonics distant from formant peaks, but not at the vowel fundamental frequency. In this study, half of the responses were recorded to a polarity-inverted stimulus, and the final result was derived by subtracting the responses obtained in one polarity from the responses obtained in the opposite polarity. This subtractive approach (see also Greenberg et al., 1987; Huis in‟t Veld et al., 1977) is analogous to the compound histogram technique used in neurophysiologic studies (Anderson et al., 1971; Goblick et al., 1969). Its rationale stems from the effects of half-wave rectification involved in inner hair cell transduction (Brugge et al., 1969). Discharges only occur during the rarefaction phase of the stimulus. If the polarity of the stimulus is inverted, the discharges to the rarefaction phase of the inverted stimulus now occur during the condensation phase of the initial stimulus. Subtracting the period histogram of this inverted stimulus from the period histogram of the non-inverted stimulus cancels the rectification-related distortion, and the discharge pattern corresponds to the stimulating waveform. Scalp-recorded frequency-following responses reflect the activity of synchronized neuronal discharges, so it is reasonable to apply the compound histogram technique to these data. This approach shows different results for envelope and spectral FFRs. By subtracting responses to alternate stimulus polarities, the alternate rectified responses to the stimulus are combined to produce non-rectified analogues of stimulus components (inasmuch as the neural system is able to phase-lock to those components). Subtracting responses to alternate polarities thus removes distortions associated with half-wave rectification (e.g. the energy at the envelope) that exist in the neural response. Using the subtractive procedure, Krishnan et al. (2002) found responses at prominent stimulus harmonics, but not at the envelope frequency. In contrast, when this
72
subtractive procedure has not been used, robust responses have been recorded at the envelope frequency (Aiken and Picton, 2006; Krishnan, 2004; Greenberg et al., 1987). An alternate technique that has been used to analyze FFR is to add responses recorded to alternate polarity stimuli (e.g. Johnson et al., 2005; Small and Stapells, 2005). This technique is generally employed to eliminate the cochlear microphonic or residual artifact from the stimulus, and to preserve the envelope FFR. Summing alternate responses cancels the cochlear microphonic and artifact, leaving the envelope FFR. The downside of this approach is that it also cancels the spectral FFR.
3.2.3
Relationship between Harmonics and Formants
Formants and formant trajectories carry information that is essential for speech sound identification. The lowest two or three formants convey enough information to identify vowels, and to specify the consonant place of articulation (Liberman et al., 1954). Formants correspond to peaks in the spectral shape, and not to specific harmonics (for a review, see Rosner and Pickering, 1994). The vocal tract can be characterized as a filter that shapes the output from the glottal source (Fant, 1970), with the peaks of the spectrum (i.e. formants) corresponding to the poles of the filter. These peaks often do not coincide with harmonics, which are multiples of the glottal source frequency, but formants are likely conveyed by the relative amplitudes of the underlying harmonics. Voigt et al. (1982) recorded auditory nerve responses to noise-source vowels (i.e. vowels with no harmonics), and found that the Fourier transforms of interval histograms had large frequency components corresponding to the peaks of the formants. However, this temporal encoding would not likely produce measureable responses at the scalp, since the temporal intervals would have occurred at random phases in the absence of a synchronizing stimulus. The frequency-following response requires synchronized neural activity. FFRs recorded to formant-related harmonics could be used to assess the audibility of formants and formant trajectories. For example, responses recorded at harmonics related to the first and second formant would indicate that the formant peaks had been neurally encoded, and that the information was likely available for the development of the phonemic inventory. Krishnan (2002) recorded frequency-following responses to synthetic vowels with steady frequencies,
73
where the peaks of the first and second formants were not multiples of f0. In this study, responses were found at harmonics related to the first two formants. Plyler and Ananthanarayan (2001) found that the frequency-following response was able to represent second-formant transitions in synthetic consonant-vowel pairs. FFRs to harmonics may thus provide useful information about speech encoding. In the present study, we recorded wideband responses to several vowels in alternate polarities and analyzed both the additive and subtractive averages with the same dataset. We also recorded two responses in the same polarity, so that we could compare the added and subtracted alternatepolarity averages to a constant-polarity average calculated across the same number of responses. We hypothesized that the average response to the constant-polarity stimuli would have energy at prominent stimulus harmonics (i.e. near formant peaks) as well as energy corresponding to the stimulus envelope. We further hypothesized that the harmonic pattern in the stimulus would be displayed in the subtractive average, and that the stimulus envelope would be displayed in the additive average. We hypothesized that we would be able to obtain reliable individual subject responses to the fundamental and stimulus harmonics up to approximately 1500 Hz (the upper limit for recording frequency-following responses; Krishnan, 2002; Moushegian et al., 1973).
74
3.3 Methods
3.3.1
Subjects
Seven women (ages 20-30) and 3 men (ages 23-30) were recruited internally at the Rotman Research Institute. All subjects were right-handed, and had hearing thresholds that were 15 dB HL or better at octave frequencies from 250 to 8000 Hz. Nine subjects (6f/3m) participated in the two main experiments, and a smaller numbers of subjects participated in subsidiary experiments to evaluate the recording montage (4f/1m) and to examine masking (2f/1m).
3.3.2
Stimuli
Two naturally produced vowels, /a/ (as in „father‟) and /i/ (as in „she‟), were recorded from a male speaker in a double-walled sound-attenuating chamber. A Shure KSM-44 large-diaphragm cardioid microphone was placed approximately 3 inches from the mouth, with an analogue lowpass filter (-6 dB/octave above 200 Hz) employed to mitigate the proximity effect. The signal was digitized at 32 kHz with 24 bits of resolution using a Sound Devices USBpre™ digitizer, and saved to hard disk using Adobe Audition™. Two tokens of /a/ and two tokens of /i/ were selected from portions of the recordings where vocal amplitude was steady. Each token was manually trimmed to be close to 1.5 s long, with onsets and offsets placed at zero-crossings spaced by integer multiples of the pitch period. This made each token suitable for continuous play, with no audible discontinuities between successive stimulus iterations. Each token was then resampled at a rate slightly higher or lower than 32 kHz in order to give exactly 48000 samples over 1.5 s. This resampling introduced a slight pitch shift, but this was less than +/- 1 Hz. Stimuli were then bandpass filtered between 20 and 6000 Hz with a 1000-point finite impulse response filter having no phase delay. Stimuli of reversed polarity were obtained by multiplying the stimulus by -1. Thus, there were in total eight stimuli – two vowels, two tokens and two polarities. These were named a1+, a1–, a2+, a2–, i1+, i1–, i2+, i2–.
75
An LPC (linear predictive coding) analysis was conducted in order to determine the formant structure of each vowel. Since formant structure can be estimated more easily after removing the low-pass characteristic of speech, the spectrum of each token (x) was pre-emphasized (or „whitened‟) using the following equation: 𝑦 𝑛 = 𝑥 𝑛 − 𝑎𝑥[𝑛 − 1] where a (.90 for the /a/ tokens and .94 for the /i/ tokens) was calculated by conducting a 1st-order linear predictive coding (LPC) analysis on each of the tokens (the 1st-order LPC providing an estimate of spectral tilt). The spectral shape of the pre-emphasized tokens was then estimated by calculating 34th-order LPC coefficients. Figure 3.1 show the spectra of the /a/ and /i/ stimuli, as calculated using the Fourier Analyzer (solid line), as well as the spectral shape of the vowels, as calculated via the 34th-order LPC analysis (dotted line). The location of the formant peaks and the closest harmonic are given in Table 3.1.
76
Figure 3.1. Spectra of the /a/ and /i/ vowels, as calculated using the Fourier analyzer (solid line) as well as the spectral shape of the vowels, as calculated via a 34th-order LPC analysis (dotted line). The reference signals of the Fourier analyzer followed the f0 trajectory, and are plotted with respect to the average frequency in each reference trajectory.
77
First Formant Vowel
Peak
Closest
Second Formant Peak
Harmonic Frequency
Closest Harmonic
Frequency
/a/
937
960(f9)
1408
1387(f13)
/i/
229
244(f2)
2613
2562(f21)
Table 3.1. Frequencies (Hz) of formants and harmonics.
78
Stimulus presentation was controlled by a version of the MASTER software (John and Picton, 2000) modified to present external stimuli. The digital stimuli were DA converted at a rate of 32 kHz, routed through a GSI 16 audiometer, and presented monaurally with an EAR-Tone 3A insert earphone in the right ear. The left ear was occluded with a foam EAR earplug. All stimuli were scaled to produce a level of 60 dBA in a 2-cm3 coupler.
3.3.3
Procedure
The first experiment examined the responses to natural vowels of the same or opposite polarity. Each 1.5-second stimulus was presented continuously for 75 seconds, corresponding to 50 iterations (with no time delay between successive presentations). This process was repeated 4 times per block, with results averaged offline to provide a single 5-minute (200-sweep) average. Each of the /i/ and /a/ tokens was presented twice in the same polarity, and once in the opposite polarity. The second experiment investigated three possible sources for the responses – electrical artifact, brainstem, and cochlear microphonic. In order to ensure that the responses were not contaminated by electrical artifact, responses were recorded to the first /a/ token routed to an insert earphone, which was not coupled to the ear. Since the subject‟s ears were occluded during the experiment, this rendered the stimulus inaudible. The transducer of this insert earphone was in the same location as when it was connected to the ear canal. We then recorded responses to the first /a/ token between electrodes at the right and left mastoids, in order to increase the sensitivity of the recording to horizontally aligned dipoles (e.g. related to activity in the cochlea, auditory nerve or lower brainstem). In a third condition, we attempted to determine whether any part of the response could reflect the cochlear microphonic, by recording responses (using the vertical Cz to nape montage) to the first /a/ token in the presence of speech-shaped masking noise. Masking eliminates neural responses without eliminating the cochlear microphonic, so any response recorded in the presence of an effective masker would likely be cochlear microphonic. The minimum effective masking level was determined for each subject by testing whether the /a/ token (a1+) could be detected while speech-shaped noise was being played. The noise was first presented at 50 dB HL, with the level
79
raised by 5 dB after two correct behavioral responses. This process was repeated until the subject could no longer detect the vowel. During the electroencephalographic recording, the noise was presented 5 dB above each subject‟s minimum effective masking level.
3.3.4
Recordings
Electroencephalographic recordings were made while subjects relaxed in a reclining chair in a double-walled sound-attenuating chamber. Subjects were encouraged to sleep during the recording. Responses were recorded between gold disc electrodes at the vertex and the midposterior neck for all conditions except the horizontal condition in the third experiment. For this condition, responses were recorded between the right and left mastoids. A ground electrode was placed on the left clavicle. Inter-electrode impedances were maintained below 5 kΩ for all recordings. Responses were preamplified and filtered between 30 Hz and 3 kHz with a Grass LP511 AC amplifier and digitized at 8 kHz by a National Instruments E-Series data acquisition card. Prior to analysis the recordings were averaged in the following way. For each subject and vowel three different responses were calculated. A +.+ average was obtained by averaging all 4 responses to the original stimulus (e.g. two presentations of the a1+ token and the a2+ token). A + – average was obtained by averaging the first 2 responses to the original stimulus (e.g. one presentation of the a1+ token and the a2+ token) together with 2 responses to the inverted stimulus (e.g. the a1- token and the a2- token). A – – average was then obtained by subtracting the 2 responses to the inverted stimulus from the 2 responses to the original stimulus. In this nomenclature, the first sign gives the operation and the second sign codes whether the second response is the response to the original or the inverted stimulus. We always start with a response to the original stimulus. Table 3.2 summarizes these procedures. For each type of average response, grand mean average responses were obtained by averaging the responses of all subjects together.
80
Response
Derivation
Components
++
Average together all responses to the
Envelope FFR
original stimulus Spectral FFR Cochlear Microphonic Stimulus Artifact +–
Average together an equal number of
Envelope FFR
responses to the original stimulus and responses to the inverted stimulus ––
Subtract responses to the inverted
Spectral FFR
stimulus from an equal number of responses to the original stimulus and divide by the total number of responses
Table 3.2. Average response nomenclature.
Cochlear Microphonic Stimulus Artifact
81
3.3.5 3.3.5.1
Analysis Natural Vowels
The energy in voiced speech is concentrated at the fundamental frequency (f0) – equal to the rate of vocal fold vibration – and at its harmonics, which are integer multiples of f0. The harmonics are labeled with a subscript that corresponds to the harmonic number. For example, when f0 is 100 Hz, f2 is 200 Hz. When it is present, f1 is equal to f0, although f0 can be perceived in the absence of any actual energy at f1. The fundamental frequency and harmonics of natural speech vary across time. The rate of f0 variation in a steady naturally produced vowel can be as high as 50 Hz/s (see Figure 2.4, bottom right). The response to the speech f0 precisely mirrors its frequency changes (Aiken and Picton, 2006; Krishnan et al., 2004), so responses to natural speech cannot be accurately analyzed with techniques that require a stationary signal. The stimuli and responses were therefore analyzed using a Fourier Analyzer. Unlike the Fast Fourier Transform (FFT), which calculates energy in static frequency bins, a Fourier Analyzer calculates energy in relation to a set of reference signals, which need not be static. Figure 3.2 shows the spectrum of the first /a/ stimulus as calculated using the FFT and as calculated using the Fourier Analyzer. Both analyses were conducted with a resolution of 2 Hz, but the reference signals of the Fourier Analyzer were constructed to follow the f0 trajectory of the speech. For the Fourier Analyzer, data are plotted relative to the mean frequency in each reference trajectory. Note that harmonic amplitudes were much greater when the analysis was conducted with the Fourier Analyzer, indicating that the FFT underestimated these amplitudes. The Fourier Analyzer was used to quantify the amplitude of the response along the trajectory of f0 and 23 of its harmonics (f2-f24). The same analyzer was used to quantify response amplitude along 16 frequency trajectories adjacent to each of the harmonics (i.e. 8 above and 8 below). Each trajectory was separated by 2 Hz, so the highest and lowest trajectories were 16 Hz above and below each harmonic, respectively. The 16 adjacent trajectories were used to quantify nonstimulus-locked electrophysiologic activity; considered to be electrophysiologic „noise‟ for the purpose of statistical testing.
82
Figure 3.2 Spectra of the first /a/ token calculated with a Fast Fourier transform (right) and with a Fourier analyzer (left). The resolution of each analysis was 2 Hz, but the reference signals of the Fourier analyzer were constructed to follow the f0 trajectory of the vowel. Fourier analyzer data are plotted with respect to the average frequency of each reference trajectory.
83
Reference frequency tracks at each trajectory were created as described in section 2.2.3, and the digital Fourier analyzer was implemented as described in section 2.2.4. Frequency tracks at each higher harmonic fi were created by multiplying the f0 frequency track by each integer between 2 and 24. Adjacent frequency tracks were created by transposing each track by the appropriate number of Hz. The significance of the response at each harmonic was evaluated by comparing the power of the response along the harmonic‟s trajectory with the power of the response along adjacent trajectories, using an F statistic (Zurek, 1992; Dobie and Wilson, 1993; Lins et al., 1996). An alpha criterion of 0.05 was selected for all analyses. A Bonferroni correction was applied to account for the 24 significance tests (1 per harmonic) involved in each analysis. Thus the F statistic was accepted as significant for the grand mean recordings at p 100 dB/octave). The envelope was calculated by computing the magnitude of the Hilbert transform (which provides the instantaneous amplitude of the signal). It was then low-passed filtered at 100 Hz and re-sampled at 250 Hz (the sample rate of the EEG response). The result was filtered between 2 Hz (8 dB/octave) and 20 Hz (> 100 dB/octave), using a zero-phase finite impulse response filter. Since psychophysical and electrophysiological responses generally vary in proportion to the log of stimulus magnitude, each envelope was transformed by taking 20 times the base 10 logarithm of the envelope. Figure 4.1 illustrates this process.
112
Figure 4.1. Calculation of the log envelope. Top: Waveform of the first sentence root („To find the body, they had to drain‟). Second: Temporal envelope of sentence. This was calculated by finding the absolute value of the complex frequency representation of the stimulus (produced using the Hilbert transform). Third: The logarithm of the envelope. Bottom: The log-envelope after filtering it between 2 and 20 Hz.
113
4.3.4
Procedure
A Tucker-Davis Technologies (TDT) RP-2.1 system controlled by MATLAB (Mathworks, Natick, MA) was used to present stimuli to the participant and send triggers to a Neuroscan SynAmps data acquisition system. The acoustic output of the TDT system was routed through a GSI 16 audiometer, and presented binaurally via ER-3A insert phones. Sentences were presented so that their overall level was 60 dB SPL in a 2-cc coupler. Participants were seated comfortably in a double-walled sound attenuating chamber and instructed to fix their gaze during stimulus presentation. One hundred and twenty sentences were presented in each of 5 blocks, in a fully randomized order. This included 10 iterations of 1 congruent and 1 incongruent completion of each of the 6 sentence roots in each block, so that each sentence root was heard 100 times during the experiment. Participants were instructed to indicate whether each sentence „made sense‟ or „did not make sense‟ by pressing one of two response buttons. The interval between the offset of one sentence and the beginning of the next was fixed at 2 seconds. Each block took approximately 12 minutes to complete.
4.3.5
Recordings
Participants were fitted with an electrode cap with 56 Ag/AgCl electrodes arranged in accordance with the International 10-20 system. The cap electrodes were FP1, FPz, FP2, AF7, AF3, AFz (ground), AF4, AF8, F7, F5, F3, F1, Fz, F2, F4, F6, F8, FC5, FC1, FCz, FC2, FC6, T7, C5, C3, C1, Cz, C2, C4, C6, T8, TP7, CP5, CP1, CPz, CP2, CP6, TP8, P7, P5, P3, P1, Pz, P2, P4, P6, P8, CB1, PO3, POz, PO4, CB2, O1, Oz, O2, and Iz. Off-cap electrodes were placed on the left and right mastoids (TP9/TP10), and on the sides of the face (FT9, FT10, F9, F10). Eye movements were recorded with electrodes placed on the infra-orbital ridge (IO1/IO2) and at the left and right outer canthi (LO1/LO2). Inter-electrode impedances were kept below 5 kOhms. A Neuroscan Synamps system was used to acquire the electrophysiologic response at a rate of 250 samples per second. The response was amplified 500 times (least significant bit = .168 V) and filtered between .15 and 50 Hz. All recordings were referenced to Cz during recording, and transformed to an average reference offline.
114
Separate recordings of vertical and horizontal eye movements, as well as blinks, were obtained before and after the experiment, and used to derive ocular source components (Ille et al., 2002; Picton et al., 2000) that were removed from each recording in BESA 5.16 (Brain Electrical Source Analysis, Berg and Scherg, 1994). In addition, trials with electrical activity exceeding 200 V in any electrode (except the peri-ocular electrodes) were excluded from the averaging. This rejection process excluded 4.6% of the trials. Recordings for each sentence root were averaged separately for each participant.
4.3.6
Source Analysis
Source analysis was based on the first 3 seconds of the grand average response to each sentence. This duration was short enough to exclude any cortical activity related to the button press (the shortest sentence root was 1.64 seconds long before the final word, which itself lasted approximately 1 second). Responses to each sentence were then concatenated to create a single grand average response to all of the sentence roots. Source analysis was conducted in BESA 5.16, using the residual variance and energy criteria. The residual variance criterion seeks to minimize the residual variance in a model by iteratively adjusting the location and orientation of equivalent dipoles. The energy criterion seeks to minimize the total amount of activity attributed to the sources. Electrodes immediately adjacent to the eyes (IO1, IO2, LO1, LO2) were excluded from the source analysis since the ocular source components technique did not completely remove all of the electrical activity caused by eye movements. Symmetrical regional sources were used to model the activity in each hemisphere, after filtering the responses between 2 and 20 Hz (12 dB/octave). Each regional source contains three orthogonal dipoles that can account for all of the activity attributed to a particular location. Since most of the regional-source activity was attributed to two of the three dipoles, each regional source was converted into two single equivalent dipoles with identical origins. This removed the orthogonality constraint inherent in the regional sources, allowing the dipole orientations to be adjusted to model the response more precisely while maintaining the symmetry between the hemispheres. The largest dipole source in each hemisphere was oriented vertically, and the second dipole source was oriented horizontally. Figure 4.2 shows the location and orientation of the dipoles. Sources were fit to the concatenated grand average response that
115
linked the responses (averaged over the ten subjects) to each of the 6 sentences. After finding the best fit for the concatenated grand average response, second and third fittings were calculated for the initial segment of the response (0-300 ms) and the remaining portion (300-3000 ms). These fits were not significantly different from the initial solution. The source dipoles calculated on the concatenated grand average response solution (Figure 4.2) were therefore adopted for the study. Source waveforms were computed for each source dipole, each sentence and each subject, and exported to MATLAB. Source waveforms show the changes over time in the current at each dipole source. For each subject and each sentence we therefore obtained four source waveforms, two (vertical and horizontal) in each hemisphere. Each source waveform was then filtered between 2 Hz (8 dB/octave) and 20 Hz (> 100 dB/octave), using a zero-phase finite impulse response filter. Figure 4.2 shows the source waveforms for a single sentence.
116
Figure 4.2. Top: Responses to the first sentence at the Cz and TP9 electrodes. The activity at the left mastoid (TP9) is largely the inverse of activity measured at Cz, reflecting the orientation and position of the dipole in the auditory cortex. Second: Diagram showing locations of horizontal and vertical dipoles in the sagittal, coronal, and transverse planes. Third: Source waveforms for the first sentence at the left and right vertical dipoles. Fourth: Source waveforms for the first sentence at the left and right horizontal dipoles.
117
4.3.7
Cross-Correlations
In order to determine whether responses were related to the stimulus envelope, source waveforms and sentence-root envelopes were compared using windowed cross-correlation. The windowed procedure allowed for the possibility that the correlations might change across different time periods within the sentence. A window of 500 ms was chosen as the shortest window that could include a full cycle of the lowest envelope frequency (2 Hz). The correlation between the stimulus envelope and the source waveform was computed in successive 500 ms windows beginning 250 ms before the onset of the sentence and continuing at 4 ms intervals through the sentence. Each 500 ms window of the stimulus waveform was correlated with 500 ms windows of the source waveform (response) occurring at delays ranging from 0 to 300 ms (again in 4 ms intervals). The delay at which the correlation was greatest (or least) could be assessed at each time in the stimulus envelope. The correlations were plotted using the “jet” color scale (see Figure 4.3), where warm colors (red and yellow) indicated a positive correlation and cool colors (light and dark blue) indicated a negative correlation. Correlations were plotted with respect to the centre of each window (in stimulus time) on the x axis, and with respect to the delay between the stimulus envelope and the response on the y axis. A consistent positive correlation between the stimulus envelope and the response at a particular delay would be represented by a horizontal band of warm color, with the vertical position of the band indicating the delay of the response.
118
Figure 4.3. Top: The filtered-log envelope of sentence 1 (red) and the average source response at the left vertical dipole (blue). The correlation between these two signals was calculated in 500 ms-wide windows at every sample point (i.e. 250 times per second). These correlations are plotted in the bottom row of the correlelogram (bottom), using the „jet‟ color scale. Two windows are marked. In the first window, the sentence envelope and the response appear to be negatively correlated. This is reflected in the correlelogram with a blue dot. In the second window, the stimulus and response are neither negatively nor positively correlated. This is reflected with a green dot in the correlelogram. Second: The same log-envelope of the stimulus and the response, after shifting the response forward 180 ms (to account for the possibility that the response may have occurred 180 ms after the stimulus). A third window is marked. In this window, the envelope and the response are positively correlated. This is reflected in the correlelogram with a red dot. Bottom: The correlelogram showing the correlations between the envelope and response at each point in stimulus time (x axis), and at response delays from 0 to 300 ms (y axis). The colour scale is automatically adjusted to span from the minimum correlation (r = -.76) to the maximum correlation (r = .84).
119
Although the magnitude of the correlation might vary across time, any true correlation between the stimulus and response (or the transient model and the response) would need to be consistent over time. Spurious correlations between stimulus and response would not likely occur at a consistent delay. One way to evaluate the consistency of the correlations is to take the mean correlation over stimulus-time at each response delay. The expected value of exclusively spurious correlations is zero. Figure 4.4 (upper right) shows the result of averaging the correlations at each response delay (i.e. in each row of the upper left plot). The 250 ms of the response immediately following the beginning of the sentence was excluded from this average, since this period was characterized by a large onset response. Note that after averaging, there is one large negative correlation at a response delay of 120 ms, and a larger positive correlation at 188 ms. This would suggest that the response followed the stimulus envelope at 188 ms, and followed the inverse of the envelope at 120 ms. The average correlation has been plotted in the second row of the figure with delay on the x-axis. The FFT of the average correlation gives the frequencies that are contributing to the correlations. This was performed after zero-padding the 300 ms waveform out to 1 second to improve the frequency resolution. The significance of the mean correlation can be assessed by determining the likelihood that the mean correlation could have occurred by chance. If the mean correlation exceeds more than 95% of the correlations that would be expected to occur by chance alone, it can be assumed to be significant at p