Neural Bases of Talker Normalization - CiteSeerX

3 downloads 0 Views 2MB Size Report
Neural Bases of Talker Normalization. Patrick C. M. Wong, Howard C. Nusbaum, and Steven L. Small ...... encoding (Eichenbaum, 2002). Although there is clear.
Neural Bases of Talker Normalization Patrick C. M. Wong, Howard C. Nusbaum, and Steven L. Small

Abstract & To recognize phonemes across variation in talkers, listeners can use information about vocal characteristics, a process referred to as ‘‘talker normalization.’’ The present study investigates the cortical mechanisms underlying talker normalization using fMRI. Listeners recognized target words presented in either a spoken list produced by a single talker or a mix of different talkers. It was found that both conditions

INTRODUCTION There is a many-to-many mapping between acoustic patterns and linguistic categories that forms a core theoretical problem for theories of speech perception (e.g., Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). Given any particular acoustic pattern, there may be several different linguistic interpretations, and theories must explain how listeners identify the intended interpretation. This lack of invariance between acoustic information and phonemes results during speech production from the influence of surrounding phonemes, variations in speaking rate, and differences among talkers. For example, in the classic study by Peterson and Barney (1952), recordings of American English vowels produced by adult men, adult women, and children yielded considerable overlap between the categories /I/ (as in ‘‘bit’’) and /a/ (as in ‘‘bet’’) in a space whose axes corresponded to the first and second resonance frequencies of the vocal tract or formant frequencies (F1  F2). But in spite of the acoustic overlap in the production of different phonetic categories, listeners are quite accurate in maintaining phonetic constancy across talkers (Strange, Jenkins, & Johnson, 1983). This is no trivial problem as can be seen in the fact that most computer speech recognition systems have to be ‘‘trained’’ on the acoustic properties of a particular talker’s voice to achieve much accuracy in recognition. How do human listeners solve this problem? Although the mechanism is not understood, over 40 years of research has begun to shape a picture of the characteristics of the process that mediates perception of speech across talker differences. For example, research has

University of Chicago

D 2004 Massachusetts Institute of Technology

activate an extensive cortical network. However, recognizing words in the mixed-talker condition, relative to the blockedtalker condition, activated middle/superior temporal and superior parietal regions to a greater degree. This temporal– parietal network is possibly associated with selectively attending and processing spectral and spatial acoustic cues required in recognizing speech in a mixed-talker condition. &

shown that listeners can use information about the vocal characteristics of specific talkers to help resolve potential acoustic–phonetic ambiguities, a process referred to as ‘‘talker normalization.’’ In an early study, Ladefoged and Broadbent (1957) had listeners identify synthetic /bVt/ words presented at the end of the precursor sentence ‘‘please say what this word is.’’ Six versions of the precursor sentence were synthesized, each with a different range of F1 and F2, corresponding to different talkers. Identification of the target vowel was found to vary according to the formant values in the carrier sentence. For example, because the vowel /I/ has a lower F1 than /a/, a vowel that was perceived as /I/ 87% of the time when the precursor had a relatively high F1 was perceived as /a/ 90% of the time when the precursor sentence had a relatively low F1. This supports one class of mechanism (referred to by Nearey, 1989 as extrinsic normalization) in which cues derived from preceding context may help constrain interpretations of a target utterance (e.g., see Lieberman, 1973; Gerstman, 1968; Joos, 1948). The alternative (called intrinsic normalization by Nearey, 1989) suggests that each utterance is selfnormalizing by virtue of vocal characteristics encoded into the utterance itself (e.g., Miller & Dexter, 1988; Syrdal & Gopal, 1986). While talker normalization has been observed for different units of speech, including vowels (Strange, Verbrugge, Shankweiler, & Edman, 1976), consonants ( Johnson, 1991; Summerfield, 1981; Rand, 1971), whole words (Mullennix, Pisoni, & Martin, 1989; Creelman, 1957), and lexical tones (Wong & Diehl, 2003), almost nothing is known about its neural substrates. Most of the research that has addressed issues in talker normalization has traditionally focused on determining the kinds of cues that listeners use during talker normalization (e.g., Neary, 1989; Rand, 1971). Little research has

Journal of Cognitive Neuroscience 16:7, pp. 1–13

specifically addressed the mechanisms that mediate this processing (although see Nusbaum & Morin, 1992; Nusbaum & Magnuson, 1997), much less the cortical mechanisms that subserve normalization. Nevertheless, the cortical mechanisms in talker normalization should depend on how listening to variation in talkers may be different from listening to a single talker. If speech recognition is a process of matching the perceived speech signal to a phonetic category by referencing the acoustic space of the talker (e.g., Syrdal & Gopal, 1986), listening to multiple talkers would require calibration every time a talker shift occurs (see Neary, 1989). This process might involve cortical regions traditionally viewed as speech perception regions, such as left inferior frontal and middle and superior temporal regions (e.g., see Zatorre, Meyer, Gjedde, & Evans, 1996). More importantly, this process could entail ‘‘additional’’ computational demands in a mixedtalker condition (see Nusbaum & Schwab, 1986; Nusbaum & Magnuson, 1997). An increase in cognitive or computational demands might be accompanied by an increase in neural activity, reflected in increased metabolic activity, hemodynamic responses, or neuron firing rates (Just, Carpenter, Keller, Eddy, & Thulborn, 1996; Lecas, 1995). Syrdal and Gopal suggested that phonetic constancy is maintained by estimating the distance between spectral peaks (e.g., F0–F1, or F3–F2). Being sensitive to wide-band frequencies, regions adjacent to the transverse temporal cortex in the superior portion of the temporal lobe could be especially suitable for such processing (Rauschecker, Tian, & Hauser, 1995). To accomplish talker normalization, cognitive resources such as working memory and selective attention may be needed when listening to variation in talkers. For example, Nusbaum and Morin (1992) found that increasing cognitive load only interacts with the speed of speech perception when there is talker variability. In their study, listeners were presented with single syllables in mixed- or blocked-talker conditions and they were asked to rapidly recognize a spoken target. In both mixed- and blocked-talker conditions, listeners were required to perform a secondary task, maintaining a list of visually presented numbers in memory (cf., Baddeley, 1992; Logan, 1979). For half of the trials, the secondary memory load involved recalling one two-digit number (low cognitive load); for the other half, the memory load consisted of three two-digit numbers (high cognitive load). This variation in cognitive load did not affect recognition times in the blocked-talker condition. However, listeners were significantly slower in target recognition in the mixed-talker condition, when the secondary task involved a high cognitive load relative to a low cognitive load. This interaction between memory load and talker variability suggests that achieving phonetic constancy across talkers requires working memory (cf., Navon & Gopher, 1979).

2

Journal of Cognitive Neuroscience

One interpretation of this interaction is that working memory may be needed when there is increased uncertainty about the phonetic interpretation of an acoustic signal (Nusbaum & Schwab, 1986). When there is talker variability, there may be multiple phonetic interpretations of any particular acoustic pattern. Listeners may have to hold the alternatives in working memory while shifting attention to test among the alternatives (Nusbaum & Morin, 1992). This phonological working memory could involve ‘‘passive’’ storage, with neural contribution from the supramarginal gyrus, and the ‘‘active’’ subvocal rehearsal region, involving Broca’s area (Paulesu, Frith, & Frackowiak, 1993; Baddeley, 1992). In order to resolve the phonological uncertainty produced by talker variability, there may be increased attentional processing (Nusbaum & Schwab, 1986; Nusbaum & Morin, 1992), particularly in terms of shifting attention to the new talker and attending to the vocal characteristics of the new talker. Nusbaum and Morin found that listeners shifted attention to different acoustic cues (F0 and formants higher than F3) when there were multiple talkers, consistent with predictions of some intrinsic theories of talker normalization (e.g., Syrdal & Gopal, 1986; Miller, 1962). Furthermore, whenever multiple talkers are present in the real world, their voices are typically associated with different spatial locations. It seems plausible that, at least in terms of the evolution of a mechanism for shifting attention among talkers in order to recognize speech (e.g., Nusbaum & Morin, 1992), such a mechanism could be physiologically linked to mechanisms of shifting attention in space (e.g., LaBerge, 1998; Posner, 1994). Thus, shifts of attention and increases in attentional processing might involve circuitry known to be involved in selective attention (e.g., the anterior cingulate cortex), spatial attention (e.g., the parietal cortex) (Fernandez-Duque & Posner, 2001), and/or sound localization (e.g., the superior parietal cortex) (Zatorre, Mondor, & Evans, 1999). The present study investigates the cortical mechanisms underlying talker normalization using functional magnetic resonance imaging (fMRI). Listeners recognized spoken target words presented in either a spoken list produced by a single talker or a mix of different talkers (Figure 1 shows the experimental paradigm, see Methods for details). Beyond the variability in talkers, the two conditions should not differ along any known parameters of task difficulty, memory requirements, attentional demands, or processing requirements. In fact, the target stimuli are identical and the distractors are phonetically matched across conditions. For example, although there may be more acoustic variability in the mixed-talker condition over and above the phonetic variation in the blocked-talker case, previous research has shown that simple acoustic variation in signal amplitude of spoken words does not produce the same behavioral results as talker variation and a constant

Volume 16, Number 7

Figure 1. Experimental paradigm.

difference in the F0 of syllables in a mixed-F0 trial only produces normalization effects when listeners expect the pitch difference to signal a talker difference (e.g., Nusbaum & Magnuson, 1997). Thus, if listening to multiple talkers differs from listening to a single talker in aspects such as talker calibration, and spatial attention as discussed, cortical regions subserving these processes should be activated when talker variability is present, but not when such variability is lacking. As described, the multiple-talker case might involve not only traditional speech areas such as the superior temporal and inferior frontal gyri in the left hemisphere, but also areas associated with attention such as the cingulate and parietal regions.

RESULTS Behavioral Results The behavioral results replicated a number of previous studies showing that speech identification is worse when there is talker variability (e.g., Nusbaum & Morin, 1992; Mullennix & Pisoni, 1990; Creelman, 1957). A repeated-measures ANOVA showed that listeners identified the target words significantly more accurately in the blocked-talker condition (92.00%) than in the mixedtalker condition (87.79%) [F(10,1) = 17.60, p < .05]. Target recognition was also significantly faster in the blocked-talker condition (529.12 msec) than in the mixed-talker condition (553.39 msec) [F(10,1) = 30.34, p < .05]. These results replicate the basic findings reported by Nusbaum and Morin (1992) showing that talker variability slows speech perception and also showing an additional effect on accuracy. In the previous

study reported by Nusbaum and Morin, speech was presented in a quiet listening environment, while in the present study the stimuli were presented in a high noise MRI environment (delivered to the ears at about 85–90 dB SPL). The additional accuracy effect may be due to the fact that listening in a noisy environment impairs speech recognition more so in mixed-talker situations than in blocked-talker situations (Mullennix et al., 1989). Imaging Results As discussed in the Introduction, talker normalization may involve a number of cortical regions because of the possible computational and auditory demands in overcoming acoustic ambiguity. Included in Table 1 are those regions of interest. For each region, voxels with percent signal change above threshold were averaged together ( p < .00001; determined by a randomization procedure, Ward, 2000; see Methods). Table 1 shows the average percent signal change for each region of interest in each experimental condition (pooled across subjects). Numerous regions were activated in both the mixed- and blocked-talker conditions. Regions of the strongest activation include the middle/superior temporal region1 and the inferior frontal gyrus, regions traditionally understood as important for speech processing. However, there are also a large number of other regions active. These additional regions of activation include those in the frontal lobe, parietal lobe, transverse temporal gyrus (TTG), postcentral and precentral gyri, and the cingulate cortex. To compare activation in the two experimental conditions, a repeated-measures ANOVA

Wong, Nusbaum, and Small

3

Table 1. Percent Signal Change in Each Experimental Condition For Each Region of Interest % Signal Change (SE) Mixed

Blocked

Angular gyrus L

0.0206 (0.0074)

0.017 (0.0068)

R

0.0215 (0.0078)

0.0135 (0.0053)

Anterior cingulate cortex L

0.0195 (0.0046)

0.0147 (0.0049)

R

0.0215 (0.0078)

0.0135 (0.0053)

Inferior frontal gyrus L

0.0423 (0.0035)

0.0402 (0.006)

R

0.0411 (0.0073)

0.0419 (0.0052)

L

0.0012 (0.0012)

0.0012 (0.0012)

R

0 (0)

0.0039 (0.0027)

Hippocampus

Middle/superior temporal gyrus* L

0.0584 (0.0054)

0.0553 (0.0047)

R

0.0594 (0.0053)

0.0563 (0.0045)

Postcentral gyrus L

0.0442 (0.0058)

0.042 (0.0042)

R

0.0323 (0.0036)

0.0336 (0.0041)

Precentral gyrus L

0.0386 (0.0031)

0.0376 (0.0034)

R

0.035 (0.0018)

0.0346 (0.0021)

Supramarginal gyrus L

0.0168 (0.0057)

0.0183 (0.0043)

R

0.022 (0.0064)

0.0185 (0.005)

Superior parietal lobule* L

0.0067 (0.0035)

0.0035 (0.0023)

R

0.027 (0.0063)

0.0155 (0.0055)

Transverse temporal gyrus L

0.0381 (0.0043)

0.0424 (0.0049)

R

0.042 (0.0146)

0.0286 (0.008)

*A main effect of condition was found ( p < .05).

4

Journal of Cognitive Neuroscience

was conducted on each region, which showed a main of effect of condition in the middle/superior temporal region [F(10,1) = 5.67, p < .05]. Percent signal change was significantly larger in the mixed-talker condition. Coordinates for the centers of mass of the largest clusters in the temporal cortex in reference to the Talairach and Tournoux (1988) map were 56, 28, 1, and 56, 28, 2 for the mixed- and single-talked conditions, respectively. In addition, there was a main effect of talker condition [F(10,1) = 5.543, p < .05] in the superior parietal lobule. Percent signal change was significantly larger in the mixed-talker condition2 (the centers of mass were 33, 67, 45, and 33, 63, 48 for the mixed- and single-talked conditions, respectively). Figures 2 and 3 show the pattern of activation in the temporal and parietal cortices, respectively, in the two experimental conditions. To investigate further the more precise location of activation in the middle/superior temporal region, a voxel-by-voxel paired t statistic was performed to compare (unthresholded) percent signal change in the mixed-talker condition with that in the blocked-talker condition across subjects. The results show that significant activation ( p > .005; determined by a randomization procedure for the middle/superior temporal region, Ward, 2000; see Methods) extended anteriorly and posteriorly in the region, but was confined entirely lateral to the TTG and mostly posterior to the anterior end of the TTG, with 3456 mm3 in the left hemisphere (center of mass equals 55, 19, 1) and 832 mm3 in the right hemisphere (center of mass equals 58, 23, 3). Figure 2 shows the pattern of activation in the superior temporal region. Similar analysis conducted on the parietal region did not show significant results.

DISCUSSION Results from the present study show that recognizing words in both blocked- and mixed-talker conditions activate extensive networks of cortical regions. First, activation was observed in regions that are traditionally viewed as associated with speech perception (e.g., superior temporal gyrus, middle temporal gyrus, and inferior frontal gyrus). Activation in precentral and postcentral gyri is likely to be associated with the sensory–motor aspect of pressing the response button in both conditions for target recognition. In addition, regions that are associated with attentional processing were also activated, including the anterior cingulate cortex and the parietal cortex. Comparison of the activated networks show that the patterns of activation in the two conditions were similar to the extent that the regions that were activated in the blocked-talker condition were also generally activated in the mixedtalker condition. Given that the two tasks were similar—both require listening and attending to a list of words and response by pressing a button—this result is

Volume 16, Number 7

Figure 2. Activation in the temporal cortex averaged across subjects. The top panel shows activation in the right hemisphere in the mixed-talker condition, and the bottom panel shows activation in the blocked-talker condition. x = 38 to 62 (every three slices are being shown).

not surprising. However, the two conditions differ in the intensity of activation in two regions, the middle/ superior temporal and the superior parietal regions. These results were likely due to the differences associated with listening to one versus multiple talkers—that is, differences that are specific to talker normalization. It is worth noting that the two conditions do not differ along any known parameters of task difficulty, memory requirements, attentional demands, or processing requirements other than those that have been discussed. Furthermore, although stimuli in the two conditions may differ in the arrangement of the syllables, offering different sequences of acoustic features, the results cannot be explained by acoustic differences between the two experimental conditions per se because the same targets and distractors are presented in both conditions—just in different arrangements separated

FPO

with ISIs of 750 msec. In their studies, Magnuson and Nusbaum (1993, 1994) synthesized two sets of monosyllabic words differing by 10 Hz in F0. In one condition, participants listened to a short dialog between two synthetic talkers differing in an F0 of 10 Hz. In a second condition, another group of participants listened to a passage in which one synthetic talker used a 10-Hz F0 increment to accent certain words. They found that in the first condition, but not in the second condition, recognition times increased when there was a mix of two different F0’s. These results showed that attributing acoustic differences to talker differences, not the acoustic differences themselves, is associated with the different behavioral performances across blocked- and mix-talker conditions. Nevertheless, it is worth acknowledging that the greater acoustic variation in the mixed-talker condition may have resulted in less

Wong, Nusbaum, and Small

5

habituation relative to the blocked-talker condition, which may have been associated with the additional cortical activation observed in the mixed-talker condition. However, habituation often has the connotation of reflecting a reduction in attention to processing. We can rule out that kind of habituation because listeners were faster and more accurate during the blockedtalker condition than the mixed-talker condition, so they were not attending less to the speech of a single talker. A different interpretation of habituation might reflect something more akin to priming or cortical tuning to the acoustic properties of the talker’s voice which is consistent with our interpretation of talker normalization. Activation in the Middle/Superior Temporal Region Various functional neuroimaging studies show activation in the middle/superior temporal region during speech

Figure 3. Activation in the parietal cortex averaged across subjects. The top panel shows activation in the mixed-talker condition, and the bottom panel shows activation in the blocked-talker condition. z = 45 to 65 (every four slices are being shown).

6

Journal of Cognitive Neuroscience

perception (e.g., Binder et al., 2000; Jancker, Wustenberg, Scheich, & Heinze, 2002). However, none of these studies associate recognition difficulty and listener effort with activation in this region. In the present study, listeners were slower and less accurate in identifying speech in the mixed-talker condition relative to the blocked-talker condition. These behavioral differences were accompanied by increased activation in the mixedtalker condition. Previous research has suggested that there is a relationship between increased effort or demand on cognitive resources and increased cortical activation. In single-cell recordings of monkeys, Lecas (1995) found that detecting changes of more contrastive visual stimuli (an easier task) resulted in neurons in the dorsolateral prefrontal region to discharge at the rate of 4.2 spikes per second, while detecting change of less contrastive visual stimuli (a more difficult task) resulted in a discharge rate of 19.1 spikes per second. In an f MRI study of humans, Just et al. (1996) found that processing

FPO

Volume 16, Number 7

complex syntactic structures, relative to processing simple syntactic structures, shows greater activation in the left laterosuperior temporal cortex and inferior frontal cortex. Similar results relating increased processing effort with increased cortical activity have been found in lexical processing (Keller, Carpenter, & Just, 2001) and visuospatial processing (Carpenter, Just, Keller, & Eddy, 1999). The present study shows that processing complex speech signals is similarly accompanied by an increase in brain activation when the context of talker variability increases the difficulty of interpretation. It has been demonstrated that in addition to relying on the first and second vocal tract resonance frequencies (F1 and F2), identification of vowels in mixed-talker condition relies heavily on the fundamental frequency (F0) as well as on the third vocal tract resonance frequency (F3). Nusbaum and Morin (1992) found that listeners were impaired in identifying vowels lacking F0, F3, or both cues only in a mixed-talker condition. These results suggest that perceiving speech in a mixed-talker condition relies on the computation of additional information in the speech signal. Specifically, these results suggest that additional spectral processing is needed in a mixed-talker condition. Results from the present study suggest that this additional spectral computational demand is likely to be associated with activation in the superior temporal region lateral to the TTG in the belt/ parabelt3 region, which has been found to be sensitive to spectral integration (e.g., frequency modulation and wider-band frequencies) in nonhuman primates (e.g., Rauschecker et al., 1995). Activation in the middle/superior temporal region involved both hemispheres in the present study. This bilateral involvement may relate to the bilateral activation in the superior temporal region associated with voice perception (Belin, Zatorre, Lafaille, Ahad & Pike, 2000). It has been argued that phoneme perception and voice perception are mutually dependent because voice is encoded into the same memory trace and representation as speech (Goldinger, 1997; Mullennix & Pisoni, 1990; Mullennix, 1997). Furthermore, in the mixedtalker condition, different speakers and different voice qualities were present. Because voice perception also requires the processing of spectral cues across the acoustic spectrum (Yiu et al., 2002), it is not surprising to see greater activation in the superior temporal region in the mixed-talker condition. It seems quite likely that one component of the difference between mixed- and blocked-talker conditions is some aspect of voice recognition. However, not all of the increased cortical activity observed in the mixed talker can be attributed to voice perception (as opposed to speech perception). Previous research has shown that the effect on perception of variation in talker voice can be distinguished from the effect of phonetic variability. For example, when listeners specifically attend to phonetic information, as in the present study, they appear to be insensitive to changes

in voice (Vitevich, 2003). In other words, close monitoring of phonetic information reduces voice recognition, suggesting that increases in activity during talker variability in the present study are probably not due to increased voice recognition processing. Moreover, cortical activity that is specific to voice characteristics appears to be independent of speech (i.e., lexical and phonetic) processing. Voice recognition activates specific cortical areas that are different from those area activated when attending to phonetic information (Belin & Zatorre, 2003; von Kriegstein, Eger, Kleinschmidt, & Giraud, 2003). In particular, von Kriegstein et al. (2003) found that although the superior temporal region shows increased activity in both hemispheres in processing vocal sounds (as opposed to nonvocal sounds), it appears to be activity in the right anterior superior temporal region that is specific to recognition of a talker’s voice. That is, talker variability in a voice recognition task results in greater right anterior superior temporal activity. The present results demonstrate increased bilateral superior temporal activity including posterior portions of the temporal region, which is more typically associated with increased effort in recognizing speech (e.g., Just et al., 1996) rather than voice (e.g., Belin et al., 2000; von Kriegstein et al., 2003). In addition, the increased activity in the parietal cortex is not reported with perception of voices but instead is more typically reported in studies of perceptual attention ( Yantis et al., 2002). Thus, although changes in cortical activity found in the mixed-talker condition may reflect an increase in the duty cycle of voice processing, there is still evidence of changes in the recognition of the speech itself over and above this.

Activation in the Superior Parietal Lobule Another finding in the present study is the increased superior parietal activation in the mixed-talker condition. Although it has been shown that the parietal cortex is involved in voice recognition, as demonstrated by the voice recognition difficulties of people with right parietal lesions ( Van Lancker, Cummings, Kreiman, & Dobkin, 1988), the superior parietal lobule is most often associated with spatial processing, particularly in the visual modality. For example, Corbetta, Miezin, Shulman, & Petersen (1993) found superior parietal activation when participants shifted attention to different visual spatial locations. More recently, this region has also been found to be associated with auditory spatial processing. Griffiths et al. (1998) asked participants to identify the directions of auditory stimuli presented with directional cues (changes in phase and amplitude between the two ears). Relative to listening to stimuli with no directional cues, they found activation in the right parietal regions to be the most prominent. Furthermore, Martinkauppi, Rama, Aronen, Korvenoja, and Carlson (2000) found that

Wong, Nusbaum, and Small

7

the superior parietal lobule is involved in working memory for auditory localization. Because communicative situations involving multiple talkers often involve attending to these talkers at various spatial locations, it is possible that the cortical mechanisms for talker normalization are interconnected with sound and spatial localization networks. Because the parietal cortex receives inputs from multiple primary sensory and associative areas as well as polymodal sensory areas (Mesulam Van Hoesen, Pandya, & Geschwind, 1977), it has been argued that the parietal cortex is responsible for attentional processes across sensory modalities. For example, in single-cell recording, Mazzoni, Bracewell, Barash, and Andersen (1996) found that neurons in the lateral intraparietal region of monkeys that responded to visual spatial location also responded to the auditory version of the same task. In addition to the established work in visual and nonverbal auditory perception, the present study provides evidence for such claim with results from speech perception. As mentioned earlier, listening to speech in a mixedtalker condition requires listeners to process additional spectral cues in the speech signal relative to listening to a single talker. Attending to spectral cues has also been found to be associated with the superior parietal lobule in both vision and audition. Le, Pardo, and Hu (1998) found that attending to color (spectral) changes activates this region. In audition, Zatorre et al. (1999) found that attending to space and frequency both activates the right superior parietal cortex. Thus, it has been argued that attending to spatial and spectral information is integrated in a unified system. This claim is also supported by behavioral evidence suggesting that frequency and spatial information are mutually dependent (Mondor, Zatorre, & Terrio, 1998). In this auditory selective attention study (Mondor et al., 1998), participants were presented with auditory stimuli varied in pitch (high or low) or location (left or right) or both and were asked to identify either the pitch or the location of these stimuli. It was found that participants were slower and less accurate when both pitch and location of the stimuli varied relative to when only the dimension which they needed to identify varied. If indeed spectral and spatial cues are unified in attention, this unified system, associated with the superior parietal lobule, is likely to be used to correctly recognize speech in mixed-talker situations, which requires attending to both spectral and spatial cues. While the superior temporal region may be responsible for processing the finer acoustic cues of the speech signal as suggested earlier and by others (e.g., Bushara et al., 1999), the parietal cortex may be specifically related to attentional resources allocated to such processing. This can be made possible by pathways between the temporal and parietal regions. It has been found that in nonhuman primates, connections exist between the superior temporal region, and the intraparietal sulcus (Lewis & Van Essen, 2000) and the posterior inferior parietal lobule (Mesulam et al.,

8

Journal of Cognitive Neuroscience

1977). These areas in the parietal region in nonhuman primates arguably correspond to the superior parietal lobule in humans in terms of cytoarchitectonics (Brodmann, 1994) and receptor architectonics (Zilles & Palomero-Gallagher, 2001). This temporal–parietal connection for auditory selective attention and speech recognition is similar to the model of visual object identification and attention proposed by LaBerge (1995), which involves connection between the occipital visual regions and the parietal cortex. Activation in Other Regions It has been proposed that attentional functions can be divided into three subtypes, orienting to sensory stimuli, vigilance, and executive attention (Fernandez-Duque & Posner, 2001). Anterior cingulate activation is often associated with ‘‘executive attention,’’ a neuropsychological notion that incorporates elements of task switching, inhibitory control, conflict resolution, error detection, planning, and allocation of attentional resources (Fernandez-Duque & Posner, 2001). Both conditions in the present experiment involve a combination of these functions, but the mixed-talker condition did not necessarily involve ‘‘additional’’ involvement of these functions. Thus, it is not surprising that the anterior cingulate cortex was active in both conditions (see Table 1), but no reliable differences across the two conditions were observed. Other than the superior and middle temporal gyrus, the other areas associated with phonological working memory, inferior frontal and supramarginal regions (Paulesu et al., 1993; Baddeley, 1992), were not found to be significantly more active in the mixed-talker condition. The inferior frontal gyrus is associated with the subvocal rehearsal component of phonological working memory (Paulesu et al., 1993). If subvocal rehearsal was required in the present experiment, the degree to which it was required in the mixed-talker condition may not be greater than in the blocked-talker condition. Thus, no significant difference was found between the two conditions. In fact, experimental tasks that are often viewed as related to phonological working memory involve monitoring phonemes or words (e.g., Awh et al., 1996), but not their acoustic correlates. Although both experimental conditions in the present study involved attending to words, talker normalization seems to largely involve integrating and attending to different acoustic cues, not phonemes or words. This could be the reason why regions traditionally viewed as related to phonological working memory were activated in both conditions, but not more active in the mixedtalker condition. Finally, the present results argue against the claims made by Goldinger (1997) and Pisoni, Saldana, and Sheffert (1996) that talker normalization effects observed in behavioral studies reflect the episodic encoding of linguistic and talker information into unified long-term

Volume 16, Number 7

memory traces. If the mixed-talker condition reflected increased episodic memory encoding due to variation in one of the cues (talker identity), this model predicts that there should be hippocampal activity and specific increases in frontal cortical activity associated with memory encoding (Eichenbaum, 2002). Although there is clear evidence that the speech perception and attentional networks increase in activity during talker variability, there is no evidence for the kind of episodic memory trace encoding that Goldinger describes. Conclusion Although recognizing a spoken target word involves an extensive cortical network, monitoring words spoken by multiple talkers, relative to monitoring the speech of just one talker, significantly activates the middle/superior temporal and superior parietal regions to a greater degree. This temporal–parietal network subserves the increased computational and attentional demands required in resolving acoustic–phonetic ambiguities introduced by multiple talkers, which involves the selective attention of spectral and spatial acoustic cues, as well as the processing of those cues.

METHODS Participants Eleven native speakers of American English, six women and five men, ranging in age from 18 to 31 years (mean 22), participated in this experiment. Participants were attending the University of Chicago at the time of the experiment and were paid for their participation. All participants were right-handed as indicated by the Edinburgh Handedness Inventory (Oldfield, 1971). Participants had no history of neurologic or audiologic symptoms, and all had normal hearing at the time of the experiment. Stimuli and Tasks Twenty monosyllabic English words spoken by four native speakers of American English (two women, two men) were used as stimuli. In the blocked-talker condition, listeners were first presented with a printed target word for one second. Printed words were projected from a BoxlightTM 4801 projector onto a DA-LITETM projection screen; participants viewed the projection screen via a mirror installed in the headcoil of the MRI scanner. This designated the target to be detected for the trial (chose from the set ball, tile, cave, or done). Following the visual presentation of the target, participants were presented with a spoken list of 16 words produced by a single talker (randomly chosen from the set ball, tile, cave, done, dime, cling, priest, lash, romp, knife, reek, depth, park, gnash, greet, jaw, jolt, bluff,

and cad) at the rate of one word per second. Four of the 16 words were targets randomly located throughout the list with the constraint that no target was first or last in the list and two targets were presented in adjacent positions. Each word was approximately 300 msec in duration followed by a 700-msec interstimulus interval (total stimulus onset asynchrony of 1 sec in all cases). Note that the word targets differed from the distracters in several phonemes so that no minimal phonetic pairs were formed, meaning that word perception did not depend critically on specific phoneme recognition (cf., Nusbaum & Morin, 1992). Participants were instructed to use the index finger of their right hand to press the left button of the response mouse as soon as they recognized the designated target word. Following each test trial, there were 16 sec of rest when no stimuli were presented. The mixed-talker condition was identical to the blocked-talker condition except that the list of 16 words was spoken by mix of four different talkers. Thus, the difference between the blocked- and mixedtalker conditions was in the serial order of the items within each type of trial. In each blocked-talker trial, each of the 16 utterances was produced by a single talker, whereas in a mixed-talker trial, each target was produced by a different talker and distractors were contributed from different talkers. Figure 1 shows the experimental paradigm. The experiment was divided into three functional scanning series (runs). Within each run of about 10 min and 40 sec in duration, participants participated in four alternating blocked- and mixed-talker conditions with the blocked-talker condition presented first (i.e., blocked-, mixed-, blocked-, mixed-, blocked-, mixed-, blocked-, mixed-talker). Inter-run interval was approximately 3 min. Stimuli were randomized and presented by using the PsyScope software (Cohen, MacWhinney, Flatt, & Provost, 1993) via headphones connected to a stereo system (Resonance Technologies, Northridge, CA). Stimuli were presented binaurally at about 78 dB SPL at the ear. The noise level of the scanner was at about 115 dB SPL; the headphones gave a 25–30 dB noise attenuation. Image Acquisition Imaging sessions were carried out in the Brain Research Imaging Center at the University of Chicago on a 3-T Signa scanner (GE Medical Systems, Milwaukee, WI) with a standard GE head coil. Thirty 5-mm-thick anatomical T1-weighted images [500 msec repetition time (TR), 6 msec echo time (TE), 608 flip angle] were acquired in the axial plane. Thirty 5-mm functional (T2*-weighted) images (TR = 2000 msec, TE = 35 msec, 608 flip angle) were acquired in the axial plane using BOLD contrast with one-shot spiral technique (Noll, Cohen, Meyer, & Schneider, 1995), providing 1.875  1.875 mm (in-plane)  5 mm (slice thickness) resolu-

Wong, Nusbaum, and Small

9

FPO

tion. Slice location exactly matched the functional and anatomical scans. A 3-D T1-weighted spiral volume scan (124 slices, 1.5-mm thick, TR = 20 msec, TE = 6 msec, 408 flip angle, 256 by 192 matrix) were acquired, allowing the identification of neuroanatomy with high resolution in all three orthogonal planes. Image Analysis Each functional scanning series was normalized to a zero mean and a quadratic detrending procedure was applied to remove changes in the baseline signal (Cox, 1996). Images were coregistered using a 3-D linear (sixparameter) automated image registration algorithm (Woods, Cherry, & Mazziotta, 1992). The resulting functional scanning series were concatenated together. Functional and anatomical images were transformed into the atlas of Talairach and Tournoux (1988) by a series of linear transformation. Anatomical regions of interests were marked automatically by using Talairach Daemon (Lancaster et al., 1997,, 2000) implemented in AFNI (Cox, 1996). Boundaries for these anatomical regions were defined by Talairach and Tournoux, and modified and implemented by Lancaster et al. (1997, 2000). Square waves modeling the events of interests (blocked- and mixed-talker conditions) were created, and these were convolved with a general model of the hemodynamic response (Cox, 1996) to create extrinsic model waveforms of the task-related hemodynamic response. The resulting waveforms were used as regressors in a multiple linear regression of the voxel-based time series (Ward, 2001). Linear coefficients signifying the fit of the regressors to the functional scanning series, voxel-by-voxel, for the blocked- and mixed-talker conditions were obtained for each participant. These coefficients were then divided by the coefficients of the baseline, resulting in percent signal change relative to the baseline. Both a single-voxel statistical threshold, F = 19.67 ( p < .00001), and a 3-D contiguity threshold of five voxels were used to determine significant activation. These values were obtained from a Monte Carlo simulation (Ward, 2000) that estimated the corresponding whole-brain alpha level as less than .05. For each region of interest, voxels with percent signal change above threshold were averaged. Group maps shown in Figures 2 and 3 were generated by averaging the thresholded percent signal change on a voxel-byvoxel basis across subjects. Unthresholded percent signal change in the middle/ superior temporal region from each condition across subjects was compared by using voxelwise 3-D paired t statistics. This analysis is able to show more precisely the location in this region that is more active in one condition. Both a single-voxel statistical threshold, F = 3.58 ( p < .005), and a 3-D contiguity threshold of 11 voxels were used to determine significant activation. A

10

Journal of Cognitive Neuroscience

Figure 4. Relative to the blocked-talker condition, activation increased in the mixed-talker condition in the superior temporal region in both hemispheres. The top and bottom panels show activation in the left hemisphere (x = 53 to 61) and right hemisphere (x = 49 to 57), respectively. For anatomical reference, the center of the cross, shown here on the lateral surface, indicates the approximate anterior extent of the transverse temporal gyrus. Functional data are projected onto the brain of one subject for visual clarity.

Monte Carlo simulation ( Ward, 2000) estimated the corresponding alpha level as less than .05 for this region (Figure 4).

UNCITED REFERENCES Blumstein, 1998 Fodor, 1983 Fowler, 1986 Geschwind, 1965 Hickok & Poeppel, 2000 Kaas & Hackett, 1999 Liberman & Mattingly, 1985 Mondor & Zatorre, 1995 Morel, Garraghty, & Kaas, 1993

Volume 16, Number 7

Pisoni, 1997 Sanides, 1975 Talavage, Ledden, Benson, Rosen, & Melcher, 2000 Tian & Rauschecker, 1995 Tian, Reser, Durham, Kustov, & Rauschecker, 2001 Acknowledgments The authors thank Peter Hlustik, Sarah Orrin, Jeremy Skipper, Ana Solodkin, and Stephen Uftring for their assistance in this study. The support of the National Institute of Deafness and Other Communication Disorders of the National Institutes of Health grant number DC-3378 is gratefully acknowledged. Reprint requests should be sent to Patrick C. M. Wong, PhD, Speech Research Laboratory, Department of Communication Sciences Disorders, Northwestern University, 2240 Campus Drive, Evanston, IL 60208, USA, or via e-mail: pwong@ northwestern.edu. The data reported in this experiment have been deposited in the fMRI Data Center (http://www.fmridc.org). The accession number is 2-2004-1166B.

Notes 1. The middle and superior temporal regions were combined in our analysis because activation in these regions was comprised of contiguous clusters covering both regions (see Figures 4 and 6). 2. Our analysis also included the extent of activation (number of voxels). Only results from the superior temporal region were found to be significant. 3. As far as we know, no research studies have been conducted to directly compare the core, belt, and parabelt regions in nonhuman primates and in humans in terms of cytoarchitectonics, macro-neuroanatomy, and functions. It is certainly difficult to pinpoint where the belt and parabelt regions are in humans based on macro-neuroanatomy. In using the terms ‘‘belt’’ and ‘‘parabelt’’ to refer to the temporal cortex in humans, we mean areas grossly lateral to the TTG.

REFERENCES Baddeley, A. (1992). Working memory. Science, 255, 556–559. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. NeuroReport, 14, 2105–2109. Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403, 309–312. Binder, J. R., Frost, J. A., Hammeke, T. A., Bellgowan, P. S., Springer, J. A., Kaufman, J. N., & Possing, E. T. (2000). Human temporal lobe activation by speech and nonspeech sounds. Cerebral Cortex, 10, 512–528. Blumstein, S. E. (1998). Phonological aspects of aphasia. In M. T. Sarno (Ed.), Acquired aphasia (3rd ed.) (pp. 157–185). San Diego, CA: Academic Press. Brodmann, K. (1994). Brodmann’s localisation in the cerebral cortex. (L. J. Garey, Trans.). London: Smith-Gordon. (Original work published in 1909). Bushara, K. O., Weeks, R. A., Ishii, K., Catalan, M. J., Tian, B., Rauschecker, J. P., & Hallett, M. (1999). Modality-specific frontal and parietal areas for auditory and visual spatial localization in humans. Nature Neuroscience, 2, 759–766.

Carpenter, P. A., Just, M., Keller, T. A., & Eddy, W. (1999). Graded functional activation in visuospatial system with the amount of task demand. Journal of Cognitive Neuroscience, 11, 9–24. Creelman, C. D. (1957 ). Case of the unknown talker. Journal of the Acoustical Society of America, 29, 655. Cohen, J. D., MacWhinney, B., Flatt, M., & Provost, J. (1993). PsyScope: A new graphic interactive environment for designing psychology experiments. Behavioral Research Methods, Instruments & Computers, 25, 257–271. Corbetta, M., Miezin, F. M., Shulman, G. L., & Petersen, S. E. (1993). A PET study of visuospatial attention. Journal of Neuroscience, 13, 1202–1226. Cox, R. W. (1996). AFNI: Software for analysis and visualization of functional magnetic resonance neuroimages. Computers in Biomedical Research, 29, 162–173. Eichenbaum, H. (2002). The cognitive neuroscience of memory. Boston, MA: Oxford. Fernandez-Duque, D., & Posner, M. I. (2001). Brain imaging of attentional networks in normal and pathological states. Journal of Clinical and Experimental Neuropsychology, 23, 74–93. Fodor, J. (1983). The modularity of mind. Cambridge, MA: MIT Press. Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28. Gerstman, L. J. (1968). Classification of self-normalized vowels. IEEE Transactions on Audio Electroacoustics, AU-16, 78–80. Geschwind, N. (1965). Disconnexion syndromes in animals and man: Part I. Brain, 88, 237–294. Goldinger, S. D. (1997). Words and voices: Perception and production in an episodic lexicon. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 33–66). San Diego, CA: Academic Press. Griffiths, T. D., Rees, G., Rees, A., Green, G. G., Witton, C., Rowe, D., Buchel, C., Turner, R., & Frackowiak, R. S. (1998). Right parietal cortex is involved in the perception of sound movement in humans. Nature Neuroscience, 1, 74–79. Hickok, G., & Poeppel, D. (2000). Towards a functional neuroanatomy of speech perception. Trends in Cognitive Sciences, 4, 131–138. Jancker, L., Wustenberg, T., Scheich, H., & Heinze, H. J. (2002). Phonetic perception and the temporal cortex. Neuroimage, 15, 733–746. Johnson, K. (1991). Differential effects of speaker and vowel variability on fricative perception. Language and Speech, 34, 265–279. Joos, M. A. (1948). Acoustic phonetics. Language, 24, 1–136. Just, M. A., Carpenter, P. A., Keller, T. A., Eddy, W. F., & Thulborn, K. R. (1996). Brain activation modulated by sentence comprehension. Science, 274, 114–116. Kaas, J. H., & Hackett, T. A. (1999). ‘‘What’’ and ‘‘where’’ processing in auditory cortex. Nature Neuroscience, 2, 1045–1047. Keller, T. A., Carpenter, P. A., & Just, M. A. (2001). The neural bases of sentence comprehension: A f MRI examination of syntactic and lexical processing. Cerebral Cortex, 11, 223–237. LaBerge, D. (1995). Computational and anatomical models of selective attention in object identification. In M. S. Gazzaniga (Ed.), The cognitive neurosciences. Cambridge, MA: MIT Press. LaBerge, D. (1998). Attentional emphasis in visual orienting and resolving. In R. D. Wright (Ed.), Visual attention. Vancouver studies in cognitive science (vol. VIII, pp. 417–454). New York: Oxford University Press. Ladefoged, P., & Broadbent, D. E. (1957). Information

Wong, Nusbaum, and Small

11

conveyed by vowels. Journal of the Acoustical Society of America, 29, 98–104. Le, T. H., Pardo, J. V., & Hu, X. (1998). 4 T-f MRI study of nonspatial shifting of selective attention: Cerebellar and parietal contributions. Journal of Neurophysiology, 79, 1535–1548. Lecas, J.-C. (1995). Prefrontal neurons sensitive to increased visual attention in the monkey. NeuroReport, 7, 305–309. Lewis, J. W., & Van Essen, D. C. (2000). Corticocortical connections of visual, sensorimotor, and multimodal processing areas in the parietal lobe of the Macaque monkey. Journal of Comparative Neurology, 428, 112–137. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431–461. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36. Lieberman, P. (1973). On the evolution of language. Cognition, 2, 59–94. Logan, G. D. (1979). On the use of a concurrent memory load to measure attention and automaticity. Journal of Experimental Psychology: Human Perception and Performance, 5, 189–207. Magnuson, J. S., & Nusbaum, H. C. (1993). Talker differences and perceptual normalization. Journal of the Acoustical Society of America, 93, 2371. Magnuson, J. S., & Nusbaum, H. C. (1994). Some acoustic and nonacoustic conditions that produce talker normalization. Proceedings of the Acoustical Society of Japan, 637–638. Martinkauppi, S., Rama, P., Aronen, H. J., Korvenoja, A., & Carlson, S. (2000). Working memory of auditory localization. Cerebral Cortex, 10, 889–898. Mazzoni, P., Bracewell, R. M., Barash, S., & Andersen, R. A. (1996). Spatially tuned auditory responses in areas LIP of macaques performing delayed memory saccades to acoustic targets. Journal of Neurophysiology, 75, 1233–1241. Mesulam, M.-M., Van Hoesen, G. W., Pandya, D. N., & Geschwind, N. (1977). Limbic and sensory connections of the inferior parietal lobule (area PG) in the Rhesus Monkey: A study with a new method for horseradish peroxidase histochemistry. Brain Research, 136, 393–414. Miller, J. L., & Dexter, E. R. (1988). Effects of speaking rate and lexical status on phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 14, 369–378. Mondor, T. A., Zatorre, R. J., & Terrio, N. A. (1998). Constraints on selection of auditory information. Journal of Experimental Psychology: Human Perception and Performance, 24, 66–79. Mondor, T. A., & Zatorre, R. J. (1995). Shifting and focusing auditory spatial attention. Journal of Experimental Psychology: Human Perception and Performance, 21, 387–409. Morel, A., Garraghty, P. E., & Kaas, J. H. (1993). Tonotopic organization, architectonic fields, and connections of auditory cortex in macaque monkeys. Journal of Comparative Neurology, 335, 437–459. Mullennix, J. W., & Pisoni, D. B. (1990). Stimulus variability and processing dependencies in speech perception. Perception and Psychophysics, 47, 379–390. Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. Journal of the Acoustical Society of America, 85, 365–377. Navon, D., & Gopher, D. (1979). On the economy of the human-processing system. Psychological Review, 86, 214–255. Nearey, T. M. (1989). Static, dynamic, and relational properties

12

Journal of Cognitive Neuroscience

in vowel perception. Journal of the Acoustical Society of America, 85, 2088–2113. Noll, D. C., Cohen, J. D., Meyer, C. H., & Schneider, W. (1995). Spiral k-space MRI imaging of cortical activation. Journal of Magnetic Resonance Imaging, 5, 49–56. Nusbaum, H. C., & Magnuson, J. (1997). Talker normalization: Phonetic constancy as a cognitive process. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 109–132). San Diego, CA: Academic Press. Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. In Y. Tohkura, Y. Sagisaka, & E. Vatikiotis-Bateson (Eds.), Speech perception, production, and linguistic structure (pp. 113–134). Tokyo: Ohmasha Publishing. Nusbaum, H. C., & Schwab, E. C. (1986). The role of attention and active processing in speech perception. In E. C. Schwab & H. C. Nusbaum (Eds.), Speech perception, vol. 1, Pattern recognition by humans and machines (pp. 113–157). New York: Academic Press. Oldfield, R. C. (1971). The assessment and analysis of handedness: The Edinburgh Inventory. Neuropsychologia, 9, 97–113. Paulesu, E., Frith, C. D., & Frackowiak, R. S. J. (1993). The neural correlates of the verbal component of working memory. Nature, 362, 342–345. Peterson, G., & Barney, H. (1952). Control methods used in a study of vowels. Journal of the Acoustical Society of America, 24, 175–184. Pisoni, D. B. (1997). Some thoughts on normalization in speech perception. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 9–32). San Diego, CA: Academic Press. Pisoni, D. B., Saldana, H. M., & Sheffert, S. M. (1996). Multi-modal encoding of speech in memory: A first report. Proceedings of the international congress on spoken language processing (pp. 1664–1667). Philadelphia, PA, October 1996. Posner, M. I. (1994). Neglect and spatial attention. Neuropsychological Rehabilitation, 4, 183–187. Rand, T. C. (1971). Vocal tract size normalization in the perception of stop consonants. Paper presented at the 81st meeting of the Acoustical Society of America, Washington. Rauschecker, J. P., Tian, B., & Hauser, M. (1995). Processing of complex sounds in the Macaque nonprimary auditory cortex. Science, 268, 111–114. Sanides, F. (1975). Comparative neurology of the temporal lobe in primates including man with reference to speech. Brain and Language, 2, 396–419. Strange, W., Jenkins, J., & Johnson, T. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74, 695–705. Strange, W., Verbrugge, R., Shankweiler, D., & Edman, T. (1976). Consonant environment specifies vowel identity. Journal of the Acoustical Society of America, 60, 213–224. Summerfield, A. Q. (1981). On articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7, 1074–1095. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086–1100. Talairach, J., & Tournoux, P. (1988). Co-planar stereotaxic atlas of the human brain: 3D proportional system: An approach to cerebral imaging. New York: Ceorge Thieme Verlag. Talavage, T. M., Ledden, P. J., Benson, R. R., Rosen, B. R., & Melcher, J. R. (2000). Frequency-dependent responses

Volume 16, Number 7

exhibited by multiple regions in human auditory cortex. Hearing Research, 150, 225–244. Tian, B., & Rauschecker, J. P. (1995). FM-selectivity of neurons in the lateral areas of rhesus monkey auditory cortex. Society for Neuroscience Abstracts, 21, 669. Tian, B., Reser, D., Durham, A., Kustov, A., & Rauschecker, J. P. (2001). Functional specialization in Rhesus monkey auditory cortex. Science, 292, 290–293. Van Lancker, D., Cummings, J. L., Kreiman, J., & Dobkin, B. H. (1988). Phonagnosia: A dissociation between familiar and unfamiliar voices. Cortex, 24, 195–209. Vitevitch, M. S. (2003). Change deafness: The inability to detect changes between two voices. Journal of Experimental Psychology: Human Perception and Performance, 29, 333–342. von Kriegstein, K., Eger, E., Kleinschmidt, A., & Giraud, A. L. (2003). Modulation of neural responses to speech by directing attention to voices or verbal content. Cognitive Brain Research, 17, 48–55. Ward, B. D. (2000). Simultaneous inference for fMRI data. Medical College of Wisconsin. Ward, B. D. (2001). Deconvolution analysis of fMRI time series data. Medical College of Wisconsin.

Wong, P. C. M., & Diehl, R. L. (2003). Perceptual normalization for inter- and intra-talker variation in Cantonese level tones. Journal of Speech, Language, and Hearing Research, 46, 413–421. Woods, R. P., Cherry, S. R., & Mazziotta, J. C. (1992). Rapid automated algorithm for aligning and reslicing PET images. Journal of Computer Assisted Tomography, 16, 620–633. Yantis, S., Schwarzbach, J., Serences, J. T., Carlson, R. L., Steinmetz, M. A., Pekar, J. J., & Courtney, S. M. (2002). Transient neural activity in human parietal cortex during spatial attention shifts. Nature Neuroscience, 5, 995–1001. Yiu, E. M., et al. (2002). Perception of synthesized voiced quality in connected speech by Cantonese speakers. Journal of the Acoustical Society of America, 112, 1091–1101. Zatorre, R. J., Meyer, E., Gjedde, A., & Evans, A. C. (1996). PET studies of phonetic processing of speech: Review, replication, and reanalysis. Cerebral Cortex, 6, 21–30. Zatorre, R. J., Mondor, T. A., & Evans, A. C. (1999). Auditory attention to space and frequency activates similar cerebral systems. Neuroimage, 10, 544–554. Zilles, K., & Palomero-Gallagher, N. (2001). Cyto-, myelo-, and receptor architectonics of the human parietal cortex. Neuroimage, 14, S8–S20.

Wong, Nusbaum, and Small

13

AUTHOR QUERIES END OF ALL QUERIES During the preparation of your manuscript, the questions listed below arose. Kindly supply the necessary information. 1. The following references do not have corresponding entries in the reference list: Neary, 1989; Miller, 1962; Mullennix, 1997; Awh et al., 1996; Lancaster et al., 1997, 2000. Please include in the reference list or delete from the text. 2. Uncited references: this section comprises references that occur in the reference list but not in the body of the text. Please position the references in the text or, alternatively, delete it. Thank you. 3. Miller & Dexter, 1998 was changed to Miller and Dexter, 1988. 4. Mullennix & Pisoni, 1989 was captured as Mullennix et al., 1989. 5. Talairach & Tournoux, 1998 was changed to Talairach & Tournoux, 1988. 6. Jancke et al., 2002 was changed to Jancker et al., 2002. 7. In note 1, Figures 4 and 6 were cited but there was no Figure 6 present in the manuscript. Should this be deleted? 8. Figure 4 was not cited in the text. Inserted Figure 4 in last paragraph of the text. Please check if this is appropriate. END OF ALL QUERIES

Suggest Documents