Identification of Mandarin tones by English-speaking ... - Ohio University

41 downloads 0 Views 564KB Size Report
nonmusicians and musical note identification by the musicians. In the Mandarin task ... and Riester, 2000; Gottfried et al., 2001; Wong et al., 2007). Mandarin is a ...
Identification of Mandarin tones by English-speaking musicians and nonmusicians Chao-Yang Leea兲 School of Hearing, Speech and Language Sciences, Ohio University, Athens, Ohio 45701

Tsun-Hui Hungb兲 School of Music, Ohio University, Athens, Ohio 45701

共Received 21 May 2008; revised 29 August 2008; accepted 31 August 2008兲 This study examined Mandarin tone identification by 36 English-speaking musicians and 36 nonmusicians and musical note identification by the musicians. In the Mandarin task, participants were given a brief tutorial on Mandarin tones and identified the tones of the syllable sa produced by 32 speakers. The stimuli included intact syllables and acoustically modified syllables with limited F0 information. Acoustic analyses showed considerable overlap in F0 range among the tones due to the presence of multiple speakers. Despite no prior experience with Mandarin, the musicians identified intact tones at 68% and silent-center tones at 54% correct, both exceeding chance 共25%兲. The musicians also outperformed the nonmusicians, who identified intact tones at 44% and silent-center tones at 36% correct. These results indicate musical training facilitated lexical tone identification, although the facilitation varied as a function of tone and the type of acoustic input. In the music task, the musicians listened to synthesized musical notes of three timbres and identified the notes without a reference pitch. Average identification accuracy was at chance level even when multiple semitone errors were allowed. Since none of the musicians possessed absolute pitch, the role of absolute pitch in Mandarin tone identification remains inconclusive. © 2008 Acoustical Society of America. 关DOI: 10.1121/1.2990713兴 PACS number共s兲: 43.71.Hw, 43.75.Cd 关DD兴

I. INTRODUCTION

Pitch perception is involved in both spoken language comprehension and music perception. For language, pitch is used for linguistic contrasts such as lexical tone and intonation 共Ladefoged, 2003兲. For music, pitch is one of the most common dimensions for constructing an organized system of musical elements 共Patel, 2008兲. Given the functional role of pitch in linguistic and musical contrasts, a legitimate question is whether the same perceptual mechanism is implicated in both linguistic and musical pitch perceptions. This study explored the relationship between linguistic and musical pitch perceptions by examining identification of Mandarin tones and musical notes by English-speaking musicians and nonmusicians. In particular, we investigated how non-native listeners with or without a musical background dealt with identifying acoustically modified Mandarin tones produced by multiple speakers. The musicians also participated in a music task where they were asked to identify musical notes without a reference pitch. Our goal was to evaluate the nature of non-native lexical tone perception by incorporating common challenges to speech perception and to assess the role of absolute pitch in the musicians’ advantage in lexical tone processing commonly reported in the literature 共Alexander et al., 2005; Gottfried, 2007; Gottfried and Riester, 2000; Gottfried et al., 2001; Wong et al., 2007兲.

a兲

Electronic mail: [email protected] Present address: Program in Cognitive Ethnomusicology, School of Music, The Ohio State University, Columbus, Ohio 43210.

b兲

J. Acoust. Soc. Am. 124 共5兲, November 2008

Pages: 3235–3248

Mandarin is a lexical tone language where pitch variation over a syllable is lexically contrastive. It has been established that F0 is the primary acoustic correlate of lexical tone 共e.g., Tseng, 1981兲. Identification of musical notes certainly involves detecting pitch as well. In other words, identification of lexical tones and musical notes both involve mapping pitch information onto discrete linguistic or musical categories. If common perceptual mechanisms are involved, performance in musical and linguistic pitch perceptions should be correlated. There is some experimental evidence on the relationship between music and lexical tone perception. Studies on lexical tone perception by non-native listeners showed that listeners with a musical background tend to identify lexical tones better than those without a musical background. Gottfried and Riester 共2000兲, also reported in Gottfried, 2007, found that English-speaking listeners who were music majors identified Mandarin tones better than nonmajors. In addition, tone identification accuracy also correlated positively with performance in detecting the direction of nonlinguistic sine-wave glides. Gottfried et al. 共2001兲, also reported in Gottfried, 2007, showed that English-speaking musicians were more accurate than nonmusicians in judging whether two stimuli had the same or different Mandarin tones. The musicians also performed better in vocally imitating the tones. Alexander et al. 共2005兲 similarly found that Englishspeaking musicians were faster and more accurate than nonmusicians in Mandarin tone discrimination and identification. Wong et al. 共2007兲 in an electrophysiological study found more robust brainstem encoding of linguistic pitch in

0001-4966/2008/124共5兲/3235/14/$23.00

© 2008 Acoustical Society of America

3235

English-speaking musicians, who also showed better identification and discrimination of Mandarin tones in a behavioral task. Taken together, these results support the idea that extensive musical experience enhances the processing of linguistic pitch, suggesting an overlap of pitch processing mechanism in the two domains. Gottfried 共2007兲, Gottfried and Riester 共2000兲, and Wong et al. 共2007兲 also hinted that the ability to track pitch direction and movement is implicated in the musicians’ superior performance in lexical tone processing. Recently, Deutsch and co-workers suggested that lexical tone is associated with absolute pitch, the ability to name or produce the note of a particular musical pitch in the absence of a reference note 共Deutsch et al., 2004a; Deutsch et al., 2006兲. In particular, Deutsch and co-workers noted that individuals with absolute pitch are able to associate a particular pitch with a verbal label, just as tone language speakers associate a particular pitch or a combination of pitches with a linguistic tonal category. Therefore, tone language speakers could be said to possess a form of absolute pitch. Since absolute pitch is treated by tone language speakers as a feature of speech, tone language speakers would be expected to show evidence of absolute pitch. To test this hypothesis, Deutsch et al. 共2004a兲 asked speakers of Mandarin, Vietnamese 共both tone languages兲, and English 共a nontone language兲 to read a word list of their native language in several recording sessions across two days. Acoustic analyses of the recordings showed that the average F0 of the enunciations across recording sessions was more consistent for the tone language speakers compared to the English speakers. This result was interpreted as demonstration of a precise and stable form of absolute pitch in tone language speakers. Burnham et al. 共2004兲 replicated this finding with a variant of the word production task. They noted, however, that the difference between the tone languages 共3/4 semitone兲 and the nontone language 共1 semitone兲 was not particularly substantial. Additional evidence for the connection between absolute pitch and tone language came from an investigation on the prevalence of absolute pitch in musicians speaking a tone versus a nontone language. Deutsch et al. 共2006兲 reported a significantly higher percentage of absolute pitch in musicians who speak Mandarin as their native language compared to those who speak English. In the absolute pitch task, participants listened to 36 piano notes and were asked to indicate the name of each note without a reference pitch. In general, absolute pitch was found to be more prevalent for people who started musical training at an earlier age. When the age of onset of musical training was taken into consideration, absolute pitch was substantially more prevalent in the Mandarin-speaking musicians, suggesting that the linguistic use of pitch may play a role in processing musical pitch categories. If absolute pitch is indeed implicated in linguistic pitch processing, one would expect superior performance by individuals with absolute pitch in lexical tone processing. Although Deutsch et al. 共2004a兲 and Burnham et al. 共2004兲 demonstrated greater consistency of pitch production in tone language speakers, it is not clear whether the acoustic data 3236

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

could address how absolute pitch was manifested perceptually. While the prevalence of absolute pitch in Mandarinspeaking musicians 共Deutsch et al., 2006兲 suggests the predisposition of tone language speakers in acquiring absolute pitch, it is not clear whether absolute pitch actually contributes to lexical tone perception. Although previous research has consistently shown the advantage of musical background in lexical tone processing 共Gottfried, 2007; Gottfried and Riester, 2000; Gottfried et al., 2001; Alexander et al., 2005; Wong et al., 2007兲, Gottfried and Riester 共2000兲, also reported in Gottfried, 2007, remains to our knowledge the only study that evaluated both linguistic and nonlinguistic pitch perceptions in the same group of individuals. To further explicate the nature of the association between linguistic and musical pitch perception and to test the absolute pitch hypothesis, it would be informative to assess performance in both lexical tone and musical pitch identification and to evaluate the correlation between the two tasks. With these considerations, the present study included two perception experiments, one on lexical tone identification and the other on musical note identification. The Mandarin tone experiment evaluated how English-speaking musicians and nonmusicians dealt with learning to identify Mandarin tonal contrasts in the face of minimal experience with Mandarin, incomplete acoustic input, and speaker variability. In particular, speech signals are often incomplete and variable, yet human listeners are remarkably adept at uncovering linguistic representations intended by speakers. In the speech perception literature, the use of acoustically degraded stimuli has revealed important information about how human listeners handle various sources of acoustic variability to achieve perceptual constancy 共e.g., Strange et al., 1983兲. Dealing with speaker variability is also an integral part of speech processing 共Johnson, 2005兲. The nature of the musicians’ advantage in lexical tone processing may be revealed when these common challenges to speech perception are incorporated into test materials. The musical note experiment was a variant of the absolute pitch task used in Deutsch et al. 共2006兲, assessing musicians’ ability to identify musical notes without a reference pitch. With measures of both lexical tone and musical note identification, the association between Mandarin tone identification and absolute pitch could be evaluated. In sum, the present study aimed to explore the relationship between linguistic and musical pitch processing by addressing the following questions: 共1兲 What is the role of musical background in the perceptual processing of intact and incomplete Mandarin tones? If musical pitch processing abilities could facilitate linguistic pitch perception, musicians would be expected to perform better than nonmusicians in the Mandarin tone identification task. 共2兲 What is the nature of the musicians’ advantage in identifying Mandarin tones, if there is indeed an advantage? If absolute pitch is implicated 共Deutsch et al., 2004a; Deutsch et al., 2006兲, there should be a positive correlation between performance in the Mandarin tone identification task and in the musical note identification task. C. Lee and T. Hung: Lexical tone and musical pitch perception

II. EXPERIMENT 1: MANDARIN TONE IDENTIFICATION

The stimuli in the Mandarin tone identification task included intact and acoustically modified Mandarin syllables produced by multiple speakers. The modified syllables included “silent-center” and “onset-only” syllables, where the majority of the voiced portion of a syllable was attenuated to silence and devoid of F0 information. As noted, the purpose of introducing acoustically degraded stimuli and speaker variability was to create a task that would incorporate common challenges to speech perception. It has been shown that silent-center Mandarin tones could be identified quite accurately by native and non-native listeners despite missing substantial F0 information, indicating that listeners are capable of reconstructing lexical tones based on limited acoustic input 共Gottfried and Suiter, 1997; Lee et al., 2006, 2008a兲. Furthermore, incomplete stimuli also allow examination of specific pitch dimensions 共e.g., pitch direction and pitch height兲 that may contribute to tone identification. Dealing with speaker variability is also an essential part of lexical tone processing. Given that F0 range varies across individuals, the absolute F0 of a given tone is likely to differ across speakers; different tones are also likely to show F0 overlap across speakers. A number of studies have shown that some speaker normalization process is implicated in lexical tone perception 共Leather, 1983; Lin and Wang, 1984; Moore and Jongman, 1997; Wong and Diehl, 2003; Zhou et al., 2008兲 and that speaker variability imposes a processing cost on the identification of intact and incomplete Mandarin tones 共Lee et al., 2008b兲. In the present study, the stimuli included syllables produced by 16 female and 16 male speakers. Acoustic analyses would be conducted to verify that the purported speaker variability actually exists. A. Method 1. Materials

The Mandarin syllable sa, produced with all four tones by 16 female and 16 male native speakers, was selected to generate the stimuli for this experiment. The recordings were made in a sound-treated booth with an Audio-technica AT825 field recording microphone connected through a preamplifier and A/D converter 共USBPre microphone interface兲 to a Windows personal computer. The speakers were instructed to read the syllables in citation form. The recordings were digitized with the Brown Laboratory Interactive Speech System 共BLISS, Mertus, 2000兲 at 44.1 kHz with 16 bit quantization. Each syllable was identified from the BLISS waveform display, excised from the master file, and saved as an audio file. The peak amplitude was normalized across syllables with the “scale to maximum” function of BLISS. Each sa syllable was digitally processed with BLISS to generate three types of syllables: intact, silent-center, and onset-only syllables. The silent-center syllables were generated by removing all but the first and final 15% of the voiced portion. The onset-only syllables were generated by removing all but the first 15% of the voiced portion. The fricative 关s兴 was preserved in all intact and modified syllables. The removed parts were digitally “silenced” such that the overall duration remained the same as that of the intact syllables. J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

FIG. 1. Illustration of stimulus construction in the Mandarin tone experiment. From top down: Intact, silent-center, and onset-only syllables. Silentcenter syllables were generated by digitally silencing the 70% of the middle of the voiced portion. Onset-only syllables were constructed by silencing all but 15% of the beginning of the voiced portion.

There were no perceptible clicks as a result of the signal processing; therefore, no further tapering procedure was applied. An example of the acoustic modifications is shown in Fig. 1. A total of 384 stimuli 共4 tones⫻ 32 speakers⫻ 3 modifications兲 were used in the experiment. 2. Participants

Participants included 36 English-speaking musicians and 36 age- and gender-matched nonmusicians. They were recruited from the student population at Ohio University with cash compensation. All reported normal hearing, speech, and language. No one had prior experience with Mandarin or any tone languages. The musicians included 20 females 共mean age = 24 years, SD= 4.5兲 and 16 males 共mean age= 21 years, SD= 2.6兲. All were junior, senior, or graduate music majors in the School of Music at Ohio University. Their area of concentration included performance, voice performance, music education, music therapy, music theory, composition, and history/literature. All musicians were required to demonstrate competency in a major and a secondary instrument among other musical proficiency requirements 共Ohio University School of Music Undergraduate Handbook, 2007–2008兲. The average age at which the musicians first received their music training was 9.4 years 共SD= 3.2兲. The nonmusicians included 20 females 共mean age= 20 years, SD= 1.36兲 and 16 males 共mean age= 21 years, SD= 1.91兲. No one reported any formal musical training or substantial musical learning experience. 3. Procedure

The stimuli, saved as individual audio files, were imported to AVRunner, the subject-testing program in BLISS, for stimulus presentation and response data acquisition. The experiment was divided into three sections: the 128 intact syllables were presented first, followed by the 128 silentcenter syllables, and finally the 128 onset-only syllables. Considering the challenging nature of the task 共i.e., identifying a novel linguistic contrast in an unfamiliar language with minimal instructions兲, the presentation of the stimuli was intentionally ordered from those with the most complete C. Lee and T. Hung: Lexical tone and musical pitch perception

3237

acoustic input to those with the least acoustic input. Breaks were given between sections, when instructions for the next section were given. For each section, the 128 stimuli produced by the 32 speakers were assigned to four blocks such that each block included only one stimulus from a given speaker. In other words, each block had 32 stimuli and all stimuli were produced by different speakers. Within each block, the number of male and female speakers was balanced 共16 male and 16 female兲, so was the number of the four tones 共eight stimuli for each of the four tones兲. For each participant, AVRunner assigned a uniquely randomized presentation order such that no two participants received the same order of presentation. The order of presentation for the blocks was also randomized for each participant. A 10 s break was given between blocks. Participants were tested individually in a sound-treated room. The experimenter first explained the lexically contrastive function of tones in Mandarin by giving minimal tone pairs as examples. The F0 contours of the four Mandarin tones were drawn on a blackboard to illustrate the differences among the four tones, which were also verbally described to the participants: the high-level tone 共tone 1兲 begins with a high F0 and stays level throughout; the high-falling tone 共tone 4兲 begins with a high F0 and moves downward to a low F0; the low-rising tone 共tone 2兲 begins with a low F0 and moves upward to a high F0; and the low-dipping tone 共tone 3兲 begins with a low F0 and either stays low or makes a dip in F0. In accordance with these descriptions, the participants were directed to four keys on the keyboard labeled with “→,” “,” “,” and “冑,” representing the high-level 共tone 1兲, high-falling 共tone 4兲, low-rising 共tone 2兲, and low-dipping 共tone 3兲 tones, respectively. Despite the common use of number designations for Mandarin tones, the number system was never introduced to the participants to avoid any additional processing demand of memorizing the arbitrary association between F0 patterns and numbers. The four labels were placed on two rows, indicating the relative height of beginning F0 of the tones. In particular, keys representing the high-level 共→兲 and high-falling 共兲 tones were placed on the upper row and those representing the low-rising 共兲 and low-dipping 共冑兲 tones were placed on the lower row. The participants were told that both F0 shape and F0 height were involved in the identity of a tone; therefore, both should be considered when making tone judgments. The participants were also told that since the syllables were produced by multiple male and female speakers, the F0 range was likely to differ across speakers. Therefore, tone judgments should be based on the tones intended by specific speakers. The participants listened to the stimuli through a pair of headphones connected to a Windows personal computer. Ten practice stimuli, none produced by any speakers in the actual experiment, were given prior to each section to familiarize the participants with the procedure and the stimulus/response format. The participants were instructed to identify the tone of each syllable by pressing one of the four labeled buttons. They were also told that their responses would be timed and that they should attempt to respond as quickly as possible. Before the silent-center syllables were presented, the ex3238

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

perimenter explained that the syllables to be presented had been digitally processed such that the center of the syllables was silenced and not audible. This point was further illustrated by erasing the center of the F0 contours drawn on the blackboard. Similarly, before the onset-only syllables were presented, it was explained that the syllables had gone through further processing such that only the beginning of the syllables would be audible. 4. Data analysis

Response accuracy and reaction time were automatically recorded by BLISS. Reaction time was measured from stimulus offset to avoid the potential confound of intrinsic duration differences among the four tones. Only correct responses were included in the reaction time analysis. Response data from all three types of syllables were first processed to evaluate the effects of musical background, syllable type, and tone. Specifically, analyses of variance 共ANOVAs兲 were conducted on accuracy and reaction time with musical background 共musicians and nonmusicians兲 as a between-subject factor, syllable type 共intact, silent-center, and onset-only兲 and tone 共1, 2, 3, and 4兲 as within-subject factors, and participants as a random factor. The data were then evaluated for each syllable type separately to examine the effects of tone and musical background. In particular, ANOVAs were conducted on accuracy and reaction time with tone as a within-subject factor, musical background as a between-subject factor, and participants as a random factor. When a main effect from the ANOVAs was significant, the Bonferroni post hoc test was used for pairwise means comparisons to keep the familywise type I error rate at 5%. Missing cells in the reaction time data due to incorrect responses constituted 0.7% of the onset-only data for the musicians, 0.7% of the silent-center data for the nonmusicians, and 0.7% of the onset-only data for the nonmusicians. Separate ANOVAs were conducted with and without replacing the missing data with listener group averages. Since the two sets of ANOVAs showed the same patterns, only the ANOVAs with the replacements will be reported. B. Results 1. All types of syllables considered

For response accuracy, ANOVAs revealed significant main effects of musical background 关F共1 , 70兲 = 46.52, p ⬍ 0.0001兴, syllable type 关F共2 , 140兲 = 240.7, p ⬍ 0.0001兴, and tone 关F共3 , 210兲 = 26.94, p ⬍ 0.0001兴. There were also significant interactions including syllable type⫻ musical background 关F共2 , 140兲 = 37.28, p ⬍ 0.0001兴, tone⫻ musical background 关F共3 , 210兲 = 14.11, p ⬍ 0.0001兴, syllable type ⫻ tone 关F共6 , 420兲 = 26.09, p ⬍ 0.0001兴, and syllable type ⫻ tone⫻ musical background 关F共6 , 420兲 = 4.22, p ⬍ 0.0005兴. Overall, accuracy was higher for the musicians 共51%, SD= 26兲 than for the nonmusicians 共36%, SD= 19兲. As expected, accuracy was highest for the intact syllables 共56%, SD= 24兲, followed by the silent-center syllables 共45%, SD = 24兲, and lowest for the onset-only syllables 共29%, SD= 14兲 关see the work of Gottfried 共2007兲, who found no difference between intact and silent-center syllables兴. All pairwise C. Lee and T. Hung: Lexical tone and musical pitch perception

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

Accuracy (%)

Intact Syllables 100 90 80 70 60 50 40 30 20 10 0

Musicians Non−musicians

1

2

1

2

3

4

3

4

2000 1750

Reaction Time (ms)

means comparisons were significant. For the tone effect, accuracy was higher for tone 2 共49%, SD= 22兲, tone 1 共47%, SD= 26兲, and tone 4 共45%, SD= 27兲 than for tone 3 共33%, SD= 18兲. All pairwise means comparisons involving tone 3 were significant. For reaction time, ANOVAs revealed significant main effects of syllable type 关F共2 , 140兲 = 49.51, p ⬍ 0.0001兴 and tone 关F共3 , 210兲 = 12.59, p ⬍ 0.0001兴. There were also significant interactions including syllable type⫻ musical background 关F共2 , 140兲 = 4.22, p ⬍ 0.05兴, syllable type⫻ tone 关F共6 , 420兲 = 16.32, p ⬍ 0.0001兴, and syllable type⫻ tone ⫻ musical background 关F共6 , 420兲 = 2.24, p ⬍ 0.05兴. In contrast to the accuracy results, no significant difference in reaction time was found between musicians 共1240 ms, SD= 473兲 and nonmusicians 共1380 ms, SD= 577兲. Consistent with the accuracy analysis, reaction time was shortest for intact syllables 共1152 ms, SD= 531兲, followed by silentcenter syllables 共1228 ms, SD= 506兲, and longest for onsetonly syllables 共1549 ms, SD= 471兲. All pairwise means comparisons were significant except for the contrast between intact and silent-center syllables. For the tone effect, reaction time was shortest for tone 1 共1231 ms, SD= 468兲, followed by tone 3 共1249 ms, SD= 526兲, tone 2 共1337 ms, SD= 537兲, and tone 4 共1421 ms, SD= 571兲. Significant pairwise means comparisons included tone 1-tone 2, tone 1-tone 4, and tone 3-tone 4. Given the significant interactions, the main effects needed to be interpreted further. Inspection of the accuracy data arranged by syllable type and musical background revealed that the difference between the musicians and nonmusicians became less as the amount of acoustic input was reduced. For intact tones, the musicians outperformed the nonmusicians substantially. When acoustic input became minimal, musical background did not facilitate tone identification anymore 关see the work of Gottfried 共2007兲, who found a difference between musicians and nonmusicians only for silent-center syllables兴. For reaction time, the pattern is consistent with the accuracy analysis, i.e., the contrast between the musicians and nonmusicians became less as the amount of acoustic input was reduced. When arranged by tone and musical background, the accuracy data showed that the musicians outperformed the nonmusicians in the identification of all tones but tone 3, indicating that the musicians’ advantage was neutralized for tone 3. The reaction time data, in contrast, showed that the musicians took less time than the nonmusicians in identifying all tones but tone 2. Overall, however, this pattern is consistent with the accuracy result. When arranged by syllable type and tone, the accuracy data showed that tone 2 and tone 4 identifications were minimally compromised when syllable center was removed 关see the work of Gottfried, 2007, who showed that tone 1 identification was compromised more by the missing center compared to the other tones兴. Our interpretation is that tone 2 共low rising兲 and tone 4 共high falling兲 are contour tones with the beginning and ending F0’s being quite distinct from each other. In contrast, the beginning and ending F0 for tone 1 共high level兲 and tone 3 共low dipping兲 are at approximately the same height and less distinct from each other. Conse-

1500 1250 1000 750 500 250 0

Tone FIG. 2. Average accuracy and reaction time 共+SE兲 of Mandarin tone identification as a function of musical background and tone for intact syllables 共n = 36兲.

quently, when only the onset and offset of a syllable were available, tones with more distinct F0 onset-offset pairs could be detected more easily. For reaction time, there was little contrast among the four tones for the silent-center and onset-only syllables. For the intact syllables, however, reaction time for tone 3 was particularly short compared to other tones. Since tone 3 could be associated with a creaky voice quality 共Gottfried and Suiter, 1997; Liu and Samuel, 2004兲, listeners might be able to detect the qualitative difference earlier in the syllable than the F0 contour. To decipher the three-way interactions, we analyzed the accuracy and reaction time data for each type of syllables separately. The results are reported below. 2. Intact syllables

Figure 2 shows the accuracy and reaction time of intact tone identification as a function of musical background and tone. For accuracy, the ANOVA revealed significant main effects of musical background 关F共1 , 70兲 = 51.36, p ⬍ 0.0001兴 and tone 关F共3 , 210兲 = 11.35, p ⬍ 0.0001兴, and a significant musical background⫻ tone interaction 关F共3 , 210兲 = 11.56, p ⬍ 0.0001兴. As predicted, the musicians 共68%, SD= 22兲 outperformed the nonmusicians 共44%, SD = 20兲. When both groups of listeners were considered, tone 1 共64%, SD= 27兲 and tone 2 共59%, SD= 20兲 were identified more accurately than tone 3 共52%, SD= 10兲 and tone 4 共50%, SD= 32兲. Significant contrasts included tone 1-tone 3, tone 1-tone 4, and tone 2-tone 4. The interaction shows that the accuracy difference between musicians and nonmusicians was substantial for tones 1, 2, and 4, but was small for tone 3. C. Lee and T. Hung: Lexical tone and musical pitch perception

3239

Accuracy (%)

Silent−center Syllables 100 90 80 70 60 50 40 30 20 10 0

Musicians Non−musicians

1

2

1

2

3

4

3

4

2000

Reaction Time (ms)

1750 1500 1250 1000 750 500 250 0

Tone FIG. 3. Average accuracy and reaction time 共+SE兲 of Mandarin tone identification as a function of musical background and tone for silent-center syllables 共n = 36兲.

For reaction time, the ANOVA revealed significant main effects of musical background 关F共1 , 70兲 = 6.3, p ⬍ 0.05兴 and tone 关F共3 , 210兲 = 37.34, p ⬍ 0.05兴. There was no interaction. Consistent with the accuracy results, the musicians 共1029 ms, SD= 440兲 responded faster than the nonmusicians 共1275 ms, SD= 585兲. When both groups of listeners were considered, tone 3 共906 ms, SD= 360兲 was identified most quickly, followed by tone 1 共1081 ms, SD= 467兲, tone 2 共1215 ms, SD= 516兲, and tone 4 共1405 ms, SD= 625兲. All pairwise comparisons were significant. In summary, tone identification performance for the intact syllables was quite impressive considering the brief instructions on the novel linguistic contrast in the unfamiliar language, the speaker variability in the stimuli, and the pressure to respond quickly. Although both the musicians and nonmusicians identified the tones with accuracy exceeding chance 共25%兲, the musicians were more accurate and faster than the nonmusicians. This result is consistent with the idea that musical training facilitates linguistic tone perception. 3. Silent-center syllables

Figure 3 shows the accuracy and reaction time of tone identification for the silent-center syllables. For accuracy, the ANOVA revealed significant main effects of musical background 关F共1 , 70兲 = 47.87, p ⬍ 0.0001兴 and tone 共F共3 , 210兲 = 46.66, p ⬍ 0.0001兲, and a significant musical background ⫻ tone interaction 关F共3 , 210兲 = 11.75, p ⬍ 0.0001兴. With the middle 70% of the voiced portion of a syllable silenced, the musicians 共54%, SD= 25兲 still outperformed the nonmusicians 共36%, SD= 20兲. When both groups of listeners were considered, tone 2 共58%, SD= 19兲 and tone 4 共52%, SD 3240

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

= 29兲 were identified more accurately than tone 1 共41%, SD= 21兲 and tone 3 共27%, SD= 13兲. All pairwise means comparisons were significant except for the contrast between tone 2 and tone 4. The interaction shows that the accuracy difference between musicians and nonmusicians remained substantial for tones 1, 2, and 4, but was minimal for tone 3. This is the same pattern as what was found for the intact syllables. For reaction time, the ANOVA revealed a significant main effect of tone 关F共3 , 210兲 = 3.61, p ⬍ 0.05兴. There was no effect of musical background or interaction. When both groups of listeners were considered, tone 2 共1143 ms, SD = 475兲 was identified most quickly, followed by tone 1 共1186 ms, SD= 422兲, tone 4 共1288 ms, SD= 572兲, and tone 3 共1295 ms, SD= 536兲. Only the tone 2-tone 3 contrast was significant. Although musicians 共1144 ms, SD= 354兲 on average responded faster than nonmusicians 共1312 ms, SD= 612兲, the difference was not significant. In summary, identification accuracy dropped and reaction time increased as acoustic input was reduced. The musicians were still more accurate, but no longer faster than the nonmusicians. The average identification accuracy 共54%兲 for the musicians remained above chance despite missing the syllable center. Identification accuracy for tone 2 and tone 4 was hardly compromised compared to intact tone 2 and tone 4. As noted, our interpretation was that tone 2 共low rising兲 and tone 4 共high falling兲 have more distinct beginning F0 and ending F0 compared to tone 1 共high level兲 and tone 3 共low dipping兲. When the onset and offset of a syllable were available, tones with more distinct onset-offset F0 pairs could be detected more easily, contributing to the higher identification accuracy. This finding also indicated potential benefits of highlighting the beginning and end of these two tones for listeners who have not had extensive experience with the language. 4. Onset-only syllables

Figure 4 shows the accuracy and reaction time of tone identification for the onset-only syllables. The ANOVA revealed a significant main effect of tone 关F共3 , 210兲 = 17.04, p ⬍ 0.0001兴 and a significant musical background ⫻ tone interaction 关F共3 , 210兲 = 3.79, p ⬍ 0.05兴. Although the musicians 共31%, SD= 15兲 on average still outperformed the nonmusicians 共28%, SD= 14兲, the difference was not significant 关F共1 , 70兲 = 3.57, p = 0.06兴. When both groups of listeners were considered, tone 1 共35%, SD= 18兲, tone 4 共32%, SD= 13兲, and Tone 2 共31%, SD= 12兲 were identified more accurately than tone 3 共20%, SD= 11兲. All pairwise means comparisons involving tone 3 were significant. The interaction shows that the musicians were more accurate than the nonmusicians only for tones 1 and 4. For reaction time, the ANOVA revealed a significant main effect of tone 关F共3 , 210兲 = 6.91, p ⬍ 0.0005兴 and a significant musical background⫻ tone interaction 关F共3 , 210兲 = 2.73, p ⬍ 0.05兴. As with the accuracy analysis, there was no difference between the musicians and nonmusicians. When both groups of listeners were considered, tone 1 共1427 ms, SD= 451兲 was identified most quickly, followed by tone 3 共1544 ms, SD= 460兲, tone 4 共1571 ms, SD= 480兲, and tone C. Lee and T. Hung: Lexical tone and musical pitch perception

Accuracy (%)

Onset−only Syllables 100 90 80 70 60 50 40 30 20 10 0

Musicians Non−musicians

TABLE I. Average duration 共in ms兲 of the onset/offset and center of the voiced portion of the syllables. The onset constituted 15%, center 70%, and offset 15% of the voiced portion of the syllables. Standard deviations are shown in parentheses. Speaker

Tone

Onset/Offset

Center

Female

1 2 3 4 1 2 3 4

58 共12兲 59 共15兲 65 共23兲 34 共8兲 42 共7兲 42 共9兲 45 共17兲 27 共5兲

279 共58兲 274 共68兲 304 共110兲 159 共37兲 197 共35兲 198 共43兲 210 共77兲 127 共25兲

Male 1

2

3

4

2000

Reaction Time (ms)

1750 1500

and the differences among the tones. To validate these interpretations, acoustic analyses were conducted on the duration and F0 of the stimuli to verify that the acoustic contrasts were actually present. These two measures were chosen because duration is a direct measure of the amount of acoustic input, and F0 has been established to be the primary acoustic correlate of Mandarin tones.

1250 1000 750 500 250 0

1

2

3

4

Tone FIG. 4. Average accuracy and reaction time 共+SE兲 of Mandarin tone identification as a function of musical background and tone for onset-only syllables 共n = 36兲.

2 共1653 ms, SD= 477兲. Significant contrasts included tone 1-tone 4 and tone 1-tone 2. Consistent with the accuracy analysis, the interaction shows minimal difference between the musicians and the nonmusicians for Tones 1 and 4. In summary, identification accuracy dropped further as acoustic input was reduced to onset only. In contrast to the previous two syllable types, both accuracy and reaction time for the onset-only syllables were comparable between the musicians and nonmusicians. For both groups of listeners, identification accuracy fell close to chance level. Unlike the silent-center syllables, where listeners might be able to reconstruct tone shapes based on the onset and offset F0, the interpolation strategy would not be useful when only the onset was present. F0 contour information is also unlikely to be useful for tone judgments given the brevity of the onsetonly syllables 共Greenberg and Zee, 1979兲. Although F0 height could serve to broadly classify the four tones into high-onset 共tones 1 and 4兲 and low-onset 共tones 2 and 3兲 tones, the high-low judgment is necessarily a relative one and would require estimation of a speaker’s F0 range as a reference frame. However, the presence of multiple speakers in the stimulus set might have added to the challenge of reliable F0 range estimation, which could have further contributed to the overall low accuracy. C. Acoustic analyses

The Mandarin tone experiment revealed several noteworthy response patterns regarding the effects of musical background, syllable type, and specific tones. To interpret these patterns, we have referred to the nature of the acoustic input, e.g., the type of F0 information in the acoustic signal J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

1. Duration

Table I shows the average duration of the three components 共onset, center, and offset兲 of the voiced portion of the syllables. Recall that the onset and offset each constituted 15% of the voiced portion and the center included 70% of the voiced portion. An ANOVA was conducted on the duration of the onset with speaker gender 共female and male兲 and tone 共1, 2, 3, and 4兲 as fixed factors and speakers as a random factor. No further ANOVAs were conducted on center or offset duration separately since they were proportional to the onset duration and would have generated the same results. For duration, the ANOVA revealed significant main effects of speaker gender 关F共1 , 120兲 = 39.35, p ⬍ 0.0001兴 and tone 关F共3 , 120兲 = 21.32, p ⬍ 0.0001兴. There was no interaction. Onset duration was longer for the female speakers 共54 ms, SD= 19兲 than for the male speakers 共39 ms, SD= 12兲. When both female and male speakers were considered, the onset duration for tone 3 共55 ms, SD= 22兲, tone 2 共51 ms, SD= 15兲, and tone 1 共50 ms, SD= 13兲 was longer than tone 4 共31 ms, SD= 8兲. All pairwise contrasts involving tone 4 was significant. Center duration was likewise longer for the female speakers 共252 ms, SD= 91兲 than for the male speakers 共183 ms, SD= 58兲. When both female and male speakers were considered, the center duration for tone 3 共257 ms, SD= 105兲,tone 2 共236 ms, SD= 68兲, and tone 1 共233 ms, SD= 60兲 was longer than tone 4 共143 ms, SD= 35兲. In summary, the tones produced by the female speakers were on average longer than those produced by the male speakers, indicating that the female speakers were speaking at a slower rate. In addition, tone 4 was consistently the shortest tone, which is consistent with the literature on the intrinsic duration difference among the four tones 共e.g., Tseng, 1981兲. Importantly, duration did not predict the relative identification accuracy in the task. In particular, tone 4 was the shortest and thus had the least amount of acoustic input, but it was often identified most accurately. On the C. Lee and T. Hung: Lexical tone and musical pitch perception

3241

Female F0 (Hz) Male F0 (Hz)

400 350 300 250 200 150 100 50 0 400 350 300 250 200 150 100 50 0

0

Tone 1

Tone 2

Tone 3

Tone 4

time (s)

time (s)

time (s)

time (s)

0.1 0.2 0.3 0.4 0.5 0.6

time (s)

0.1 0.2 0.3 0.4 0.5 0.6

time (s)

0.1 0.2 0.3 0.4 0.5 0.6

time (s)

0.1 0.2 0.3 0.4 0.5 0.6

time (s)

FIG. 5. The F0 contours of the four Mandarin tones produced by the 16 female 共top兲 and 16 male 共bottom兲 speakers.

other hand, even though tone 3 had the longest duration among the four tones, it was consistently identified with the lowest accuracy. Therefore, duration alone was not a good predictor of Mandarin tone identification performance. It should be noted, however, that intrinsic duration difference has been shown to be a useful cue to Mandarin tone identification for native listeners 共Blicher et al., 1990; Liu and Samuel, 2004; Whalen and Xu, 1992; Xu et al., 2002兲. 2. Fundamental frequency

Figure 5 shows the F0 contours of the four tones produced by all speakers, shown separately for the female and male speakers. Despite obvious variability among the speakers, the F0 contours are generally consistent with traditional descriptions, i.e., high level for tone 1, low rising for tone 2, low dipping for tone 3, and high falling for tone 4. Recall from the perception experiment that identification of silentcenter tone 2 and tone 4 was hardly compromised compared to intact tone 2 and tone 4. The perceptual resistance to the silent-center modification was attributed to their distinct onset-offset F0 values. Our acoustic data here are consistent with the interpretation. Finally, the difference in F0 range between female and male speakers was expected. That is, the female speakers had a higher F0 range than the male speakers. D. Summary and discussion

With brief instructions and under time pressure, musicians without prior Mandarin experience were able to identify Mandarin tones with 68% accuracy, exceeding chance 共25%兲. When the majority of the voiced portion of the syllables was digitally silenced and devoid of F0 information, the musicians managed to identify the tones with 54% accuracy, still exceeding chance. The accuracy dropped to 31% when the acoustic input was further reduced to the beginning 15% of the voiced portion of the syllable, indicating syllable onset alone was not sufficient for reliable tone identification. 3242

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

With identical instructions, age- and gender-matched nonmusicians did not perform as well as the musicians. In particular, the nonmusicians identified the intact, silentcenter, and onset-only syllables with accuracies at 44%, 35%, and 27%. The overall higher identification accuracy by the musicians indicates that listeners with a musical background had an advantage in lexical tone identification. This observation is consistent with studies showing superior performance in Mandarin tone perception by English listeners with musical training 共Gottfried, 2007; Gottfried and Riester, 2000; Gottfried et al., 2001; Alexander et al., 2005; Wong et al., 2007兲. Nonetheless, the contrast in identification accuracy alone did not address the nature of the processing difference between the musicians and nonmusicians. In the current study, comparisons among the three types of syllables and among the four tones further revealed similarities and differences between the musicians and nonmusicians in perceiving Mandarin tones. Both groups showed a linear decrease in accuracy and increase in reaction time as acoustic input was reduced. Both groups also showed little drop in accuracy for silent-center tone 2 and tone 4 compared to intact tone 2 and tone 4, indicating that the ability to reconstruct tone identity based on onset-offset information was present in both groups of listeners. On the other hand, the musicians were different from the nonmusicians in how the tone identification performance was influenced by the acoustic modification. Specifically, the advantage of musicians became less obvious when acoustic input was reduced. When tonal information was fully present in the intact syllables, the musicians were both more accurate and faster in tone identification 关see the work of Gottfried, 共2007兲, who found no accuracy difference between musicians and nonmusicians兴. When the majority of the F0 information was unavailable in the silent-center syllables, the musicians were still more accurate but were no longer faster 关see the work of Gottfried, 2007, who also showed an accuracy difference between musicians and nonmusicians兴. Finally, when F0 information was limited to the beginning C. Lee and T. Hung: Lexical tone and musical pitch perception

15% of the voiced portion of the syllables, the musicians were neither more accurate nor faster than the nonmusicians. In other words, this experiment revealed that the musicians’ processing advantage in tone identification depends on the type of input available in the acoustic signal. Music background did help in lexical tone identification, but the advantage was compromised by the reduction of acoustic input. While these results revealed further information about lexical tone perception by non-native musicians and nonmusicians, the question remained of what specific component of musical background contributed to the Mandarin tone identification performance. As noted, both F0 direction and F0 height are involved in Mandarin tone distinctions. It could be the enhanced sensitivity to F0 direction/movement 共Gottfried, 2007兲, the ability to detect F0 height, or both, that contributed to the musicians’ better identification. Although the musicians as a group outperformed the nonmusicians, there was considerable variability among the musicians in their lexical tone identification performance. On the other hand, it is possible that some variability could also exist in the musicians’ ability to process musical pitch information. Therefore, it would be of interest to explore whether the performance in lexical tone identification is correlated with performance in musical pitch perception. As noted, Gottfried and Riester 共2000兲, also reported in Gottfried, 2007, showed higher accuracy by musicians in identifying both Mandarin tones and sine-wave tone glides with analogous F0 shifts, suggesting that the ability to perceive F0 direction underlies the superior performance by musicians in Mandarin tone identification. In the next experiment, we evaluate the potential role of F0 height detection in Mandarin tone identification. To that end, we adopted a task that gauges the musicians’ ability to name a musical note without a reference pitch, i.e., absolute pitch 共Deutsch et al., 2006兲, as an index of musical pitch processing. As noted, absolute pitch has been implicated in the acquisition and processing of lexical tones 共Deutsch et al., 2004a; Deutsch et al., 2006兲. If absolute pitch is indeed involved in lexical tone perception, we would expect that musicians with higher scores in the absolute pitch task would also perform better in the lexical tone identification task.

III. EXPERIMENT 2: MUSICAL NOTE IDENTIFICATION BY MUSICIANS

In this experiment, the 36 musicians who participated in the Mandarin tone experiment were asked to listen to synthesized musical tones of three timbres 共pure tone, piano, and viola兲 and to identify the notes in the absence of a reference pitch. The setup of the experiment was identical to the absolute pitch task used in Deutsch et al., 2006 except for the use of two additional timbres in the stimuli. The use of multiple timbres was motivated by the finding that the manifestation of absolute pitch could depend on the specific instruments used to generate the musical notes 共Lockhead and Byrd, 1981兲. It is therefore informative to evaluate the potential impact of timbre on musical note identification. J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

A. Method 1. Materials

Thirty-six notes that spanned a three-octave range from C3 共131 Hz兲 to B5 共988 Hz兲 with the equal-tempered scale were synthesized with three timbres 共pure tone, piano, and viola兲 for a total of 108 notes. The pure tones were synthesized using the MATLAB software 共The MathWorks兲. The piano and viola notes were synthesized using a Kurzweil K2000 synthesizer tuned to the standard A4 at 440 Hz. The duration for all notes was 500 ms. For each timbre, the 36 notes were ordered such that any two consecutive notes were separated by more than an octave. As noted by Deutsch et al. 共2006兲, this was done to prevent listeners from developing relative pitch as a reference for the task. The 36 notes of a given timbre were divided into three blocks of 12 notes, with a 5 s interstimulus interval and a 10 s break between the blocks. Notes of a given timbre were always presented together. To counterbalance the order of timbre presentation, six lists of stimulus presentation were created: 共1兲 pure tone-piano-viola; 共2兲 pure tone-viola-piano; 共3兲 piano-pure tone-viola; 共4兲 piano-violapure tone; 共5兲 viola-pure tone-piano, and 共6兲 viola-pianopure tone. The 36 musician participants were randomly assigned to receiving one of the six lists such that a given list was used for six participants. 2. Participants

The 36 musicians in the Mandarin tone identification experiment participated in this experiment. 3. Procedure

The synthesized musical notes were saved as individual audio files and imported to AVRunner for stimulus presentation. Twelve practice trials, synthesized as oboe sounds with the Kurzweil K2000 synthesizer, were given prior to the actual experiment to familiarize the participants with the presentation format. The practice notes were selected from the same pitch range 共C3-B5兲 and were also 500 ms long. As in the actual experiment, any two consecutive notes were separated by more than an octave. None of the practice notes appeared in the actual experiment. No feedback was given. Participants listened to the stimuli through a pair of headphones connected to a Windows personal computer. They were instructed to write down the notes on a customized staff paper. In particular, the participants were told that they would be listening to nine blocks of 12 notes of three timbres ranging from C3 to B5. Their task was to notate the notes that they had heard on the staff paper immediately after each note was played and to apply accidental signatures if applicable. The participants were also told that there would be 5 s to respond to each note and that there would be a 10 s break between blocks. 4. Data analysis

The written responses were graded by the second author of this study 共a musician兲. One-way repeated measures ANOVAs were conducted on response accuracy with timbre 共pure tone, piano, and viola兲 as a within-subject factor and C. Lee and T. Hung: Lexical tone and musical pitch perception

3243

100 Piano Pure tone Viola

90 80

Accuracy (%)

70 60 50 40 30 20 10 0 0

1

2

3

Number of Semitone Errors Allowed

FIG. 6. A box plot showing the accuracy of musical note identification for piano, pure tone, and viola as a function of the number of semitone errors allowed.

participants as a random factor. There were three dependent variables, including 共1兲 percentage of accurate note identification, 共2兲 percentage of accurate note identification allowing one-semitone errors, 共3兲 percentage of accurate note identification allowing two-semitone errors, and 共4兲 percentage of accurate note identification allowing three-semitone errors. When a main effect from the ANOVAs was significant, the Bonferroni post hoc test was used for pairwise means comparisons to keep the familywise type I error rate at 5%. The motivation for allowing up to three-semitone errors in the dependent measures was to avoid potential floor-level performance across the board due to the challenging nature of the task. In particular, Deutsch et al., 共2006兲 found that only 15% of the English-speaking musicians at the Eastman School of Music were able to attain identification accuracy of 85% or higher in a similar task. If correct identification rate was uniformly low and individual variability was completely absent, the correlation analyses to be performed between music note and lexical tone identification would not be fully meaningful. Our purpose was not to evaluate the percentage of musicians with absolute pitch per se, but rather to obtain a measure of the musicians’ absolute pitch ability for the correlation analyses. B. Results

Figure 6 shows a box plot for the accuracy of musical note identification arranged by timbre and the number of semitone errors allowed. When an exact match was required, accuracy was uniformly low for all three timbres: pure tone 共mean= 12% , SD= 12兲, piano 共mean= 11% , SD= 10兲, and viola 共mean= 11% , SD= 10兲. Since there are 12 semitones within an octave, chance level performance is 8.3%. The obtained averages therefore indicate that the musicians’ identification accuracy was within chance level. Inspection of individual data showed that the highest accuracy attained was 64%, i.e., none of the musicians met the 85% criterion for absolute pitch as defined in Deutsch et al. 共2006兲. The ANOVA revealed no significant timbre effect. 3244

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

When one-semitone errors were allowed, accuracy was expectedly higher: piano 共mean= 26% , SD= 17兲, pure tone 共mean= 26% , SD= 14兲, and viola 共mean= 24% , SD= 11兲. Although these averages appear to be improved, they are still within chance level. In particular, permitting one-semitone errors effectively allows a range of three semitones, indicating a chance level of 25%. Inspection of individual data showed that the highest accuracy attained was 75%, i.e., none of the musicians met the 85% criterion for absolute pitch. The ANOVA revealed no timbre effect. When two-semitone errors were allowed, accuracy became higher as expected: piano 共mean= 42% , SD= 18兲, pure tone 共mean= 40% , SD= 16兲, and viola 共mean= 35% , SD = 14兲. Nonetheless, these averages are still within chance level. Specifically, permitting two-semitone errors effectively allows a range of five semitones, indicating a chance level of 41.7%. Inspection of individual data showed that the highest accuracy attained was 78%. The ANOVA revealed a significant timbre effect 关F共2 , 70兲 = 3.41, p ⬍ 0.05兴. Pairwise means comparisons showed the accuracy for piano was higher than that for viola. When three-semitone errors were allowed, accuracy became expectedly higher: piano 共mean= 53% , SD= 19兲, pure tone 共mean= 51% , SD= 18兲, and viola 共mean= 47% , SD = 17兲. As before, although these averages appear to be further improved, they are still within chance level. In particular, permitting three-semitone errors effectively allows a range of seven semitones, indicating a chance level of 58.3%. Inspection of individual data showed that the highest accuracy attained was 92%. Four of the 36 musicians identified the notes at 85% correct or higher. The ANOVA revealed no significant timbre effect. C. Summary and discussion

The task of identifying musical notes without a reference pitch was apparently challenging. By the criterion used in Deutsch et al. 共2006兲, none of the musicians participating in this experiment qualified as possessing absolute pitch. Even when the scoring criterion was relaxed to allow errors as many as three semitones, average identification performance remained at chance level. Recall that the average age at which our musicians first received their music training was 9.4 共SD= 3.2兲. Inspection of the data in Deutsch et al., 2006 共Fig. 1兲 revealed that none of their English-speaking musicians in the corresponding age 共of commencement of musical training兲 group possessed absolute pitch either. Our result is therefore consistent with the idea that a critical period is involved in the acquisition of absolute pitch. Overall, timbre did not make a difference except when two-semitone errors were allowed. In this case, accuracy was higher for piano sounds than for viola sounds. This result is perhaps not surprising considering that all our musicians were required by their program of study to demonstrate proficiency in piano irrespective of their major instrument. Consequently, on average, they are likely to have more experience with piano than with viola. Finally, individual variability can be clearly seen from the box plot, more prominently when the scoring criterion C. Lee and T. Hung: Lexical tone and musical pitch perception

was relaxed to allow more errors. However, on average, performance in the musical note identification task was still within chance level, indicating none of the musicians used in the current study could be claimed to possess absolute pitch. The absence of absolute pitch in the musicians posed a challenge to the interpretation of the planned correlation analyses between the accuracy measures in the Mandarin tone identification task 共response accuracy for intact, silent-center, and onset-only syllables兲 and those in the musical note identification task 共accuracy of note identification allowing zero-, one-, two-, and three-semitone errors兲. Even though our preliminary analyses showed only 1 significant correlation out of the 12 possible correlations 共r = 0.329, p ⬍ 0.05, found between intact syllables identification and musical note identification when three-semitone errors were allowed兲, a different pattern could emerge for musicians who do possess absolute pitch. IV. GENERAL DISCUSSION

Two perception experiments investigated the processing of linguistic and musical pitch by musicians and nonmusicians. The Mandarin tone experiment showed that the musicians were able to identify multispeaker intact and silentcenter Mandarin tones with accuracy exceeding chance. The musicians also outperformed age- and gender-matched nonmusicians, even though the group difference became smaller as acoustic input was reduced. In the musical pitch task given to the musicians, accuracy of musical note identification remained within chance level even when errors up to three semitones were allowed, indicating none of the musicians used in this study possessed absolute pitch as defined in Deutsch et al., 2006. Although the finding that musicians outperformed nonmusicians in lexical tone perception is not new 共Alexander et al., 2005; Gottfried, 2007; Gottfried and Riester, 2000; Gottfried et al., 2001; Wong et al., 2007兲, the present study revealed new information about the nature of lexical tone perception by non-native listeners with or without a musical background. In particular, by simulating common challenges to speech perception with degraded stimuli and speaker variability, the present study showed that the benefit of musical experience varied depending on the type of acoustic input. Specifically, even though the musicians were faster and more accurate in identifying intact Mandarin tones, the reaction time advantage disappeared for silent-center syllables. When acoustic input was further reduced to onset-only syllables, the accuracy advantage also disappeared. In other words, the advantage of musical training in learning to identify nonnative lexical tones depends on the type of acoustic input. It would be of interest to investigate if other types of degraded signals 共e.g., signals presented with noise兲 would exert a similar effect. In addition, the group-tone interaction found in all three types of syllables 共Figs. 2–4兲 showed that the musicians’ advantage arose primarily from tones 1, 2, and 4. In contrast, tone 3 identification accuracy was comparable between the two groups of listeners. The musicians were particularly error prone in tone 3 identification. For the nonmusicians, tone J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

3 also generated the lowest accuracy among the four tones in silent-center and onset-only syllables. Taken together, it appears that tone 3 was generally difficult to identify for both groups. The difficulty could be attributed to the variable nature of tone 3. Although phonologically specified as a lowdipping tone with a falling-rising contour, the rising part does not normally show up phonetically except in isolation or in a phrase-final position. In addition, tone 3 is also involved in a common tone sandhi where tone 3 becomes tone 2 共a rising tone兲 when followed by another tone 3. Finally, for many speakers, there is often glottalization in the middle of the tone, resulting in breaks in the F0 contour 共Gottfried and Suiter, 1997; Liu and Samuel, 2004兲. Although the tone 3 tokens used in this study were all produced in isolation, the acoustic data 共Fig. 5兲 showed that not all tone 3 tokens maintained its canonical F0 contour and that there were visible F0 breaks for some speakers. The variable nature of this tone could have contributed to the difficulty in identification and the reduced advantage of the musicians. Despite these contrasts, there were some similarities between the musicians and nonmusicians. It is perhaps not surprising that both groups showed a linear drop in accuracy and a corresponding increase in reaction time as a function of the type of acoustic input. More interestingly, analysis by tone revealed that the four tones were influenced differently by the acoustic reduction. In particular, for both groups of listeners, tone 2 and tone 4 identifications were minimally affected by the removal of the syllable center. That is, both groups were able to uncover the identity of these two tones despite the missing center. It was hypothesized that the distinct onset and offset F0 for tone 2 and tone 4 would be particularly useful for reconstructing pitch contour when the syllable center was missing. This observation was supported by our acoustic data 共Fig. 5兲. Nonetheless, the overall accuracy for the musicians 共68% for intact tones and 54% for silent-center tones兲 is quite remarkable considering the minimal instructions given, the novel linguistic contrast in the unfamiliar language, the speaker variability, and the time pressure to respond quickly. In a study with a similar speeded-response task by Lee et al. 共2006兲, 40 English-speaking listeners with one to three years of Mandarin experience achieved accuracy of 76% for intact tones and 59% for silent-center tones 共see native listeners in Lee et al., 2008a achieved 97% for intact tones and 86% for silent-center tones兲. Gottfried and Suiter 共1997兲 in a study with a similar task without speeded response reported average accuracy of 79% for intact tones and 65% for silentcenter tones by nine English-speaking listeners with an average of five years of Mandarin experience 共see native listeners in the same study achieved 98% for intact tones and 90% for silent-center tones兲. Importantly, the stimuli in Lee et al., 2006 or Gottfried and Suiter, 1997 were produced by a single speaker, i.e., the participants in those two studies did not have to deal with speaker variability. With substantially less Mandarin experience but the added challenges, the musicians’ tone identification performance in the present study is quite impressive. What could be the basis for the superior performance by the musicians in tone identification? We attempted to evaluC. Lee and T. Hung: Lexical tone and musical pitch perception

3245

ate the hypothesis that absolute pitch is implicated in lexical tone processing 共Deutsch et al., 2004a; Deutsch et al., 2006兲 by examining the correlation between Mandarin tone identification and absolute pitch performance. However, the absence of absolute pitch in our musician participants made it impossible to fully evaluate the correlation results. In particular, the lack of robust correlation in the current data cannot rule out the potential association between absolute pitch and lexical tone perception. Significant correlations between Mandarin tone identification and musical note identification may be obtained if the Mandarin tone task is given to musicians who actually possess absolute pitch. In addition, the nature of the stimuli and response format may have encouraged the participants to attend more to F0 contour information than F0 height information. In particular, since a large number of speakers were used to generate the stimuli, there was considerable overlap in F0 range for different tones 共Fig. 5兲. Given the variability of F0 height introduced by the use of multiple speakers, listeners may be inclined to use contour rather than height information. In addition, even though the response keys were arranged to reflect the relative height of the tones and the listeners were instructed to pay attention to both contour and height information, the keys did show tonal contours prominently, which could have primed the listeners to focus more on contour information. Furthermore, the benefit of absolute pitch may be revealed in a task that evaluates the acquisition of lexical tones after substantial training. In particular, learning a novel linguistic contrast for adult learners usually needs extensive time and exposure to the language. The present study, however, did not allow ample opportunities for the participants to get familiar with the Mandarin tonal contrasts. What was evaluated in the present study was the rapid perceptual learning rather than the acquisition of lexical tones. It is possible that individuals with absolute pitch could demonstrate an advantage in the acquisition of lexical tones with more training. Even though the musicians used in this study did not possess absolute pitch, they did achieve accuracy values that substantially exceeded chance for intact and silent-center Mandarin tones, indicating a certain degree of success in learning to map the acoustic input onto Mandarin tonal categories. Their performance also approached that of the nonnative learners who had more Mandarin experience and who did not have to deal with speaker variability 共Gottfried and Suiter, 1997; Lee et al., 2006兲. Furthermore, the musicians did outperform the nonmusicians in all but the onset-only syllable task. All these findings indicate that musical experience indeed facilitated lexical tone perception. Since none of the musicians in this study possessed absolute pitch, the advantage must have come from other sources. Gottfried and Riester 共2000兲, also reported in Gottfried, 2007, to our knowledge the only study that used both linguistic and nonlinguistic pitch tasks, suggested that musicians’ superior performance in lexical tone identification may be attributed to the ability to track pitch movement effectively. This observation is consistent with recent neurophysiological evidence showing more faithful tracking of pitch by musicians than by 3246

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

nonmusicians 共Wong et al., 2007兲. In other words, the musicians’ advantage in the present study may have come from their superior ability to track and remember the tonal patterns and to identify new examples of the patterns. More broadly, the literature on music and language has also identified association between musical pitch discrimination and reading abilities in 5-year-olds 共Anvari et al., 2002兲 and association between pitch pattern perception and second-language skills 共Slevc and Miyake, 2006兲. It would be of interest for future studies to explore the association between lexical tone processing and musical processing abilities other than absolute pitch such relative pitch detection and sensitivity to duration differences. Although the present study did not provide evidence for the association between lexical tone perception and absolute pitch, the absolute pitch—lexical tone link appears well motivated by other evidence. As Deutsch et al. 共2004b兲 noted, both lexical tone and absolute pitch involve associating a particular or several pitches with a verbal label. Deutsch and co-workers also showed that a speaker’s language background, linguistic experience in childhood, and vocal range are associated with the tritone paradox 共Deutsch, 1991; Deutsch et al., 2004b; Deutsch et al., 1990兲. Dolson 共1994兲 reported that the pitch range of an individual’s speaking voice is constrained primarily by his or her linguistic community, suggesting that listeners could have formed certain pitch templates through exposure to the prevailing pitch range of their linguistic community. Consequently, they are able to use those templates in the processing of vocal pitch. This idea seems consistent with Honorof and Whalen 共2005兲, who reported that English listeners can locate F0 height reliably from isolated and naturally produced speech without context or prior exposure to a speaker’s F0 range. Finally, there is evidence showing the use of absolute pitch by infants in perceptual learning 共Saffran and Griepentrog, 2001兲. The acquisition of absolute pitch is also associated with early musical training, implying a critical period for absolute pitch acquisition 共Deutsch et al., 2006兲. These parallels led to the speculation that through exposure to the lexically contrastive use of pitch, an implicit form of absolute pitch is acquired by tone language speakers 共Deutsch, 2006兲. However, although absolute pitch and lexical tone both involve associating pitch with a verbal label, there are important differences between these two constructs. Absolute pitch requires associating a particular pitch with a verbal label without a reference pitch. Lexical tone, on the other hand, normally requires associating a verbal label with a pitch pattern, not just a single pitch. This is true at least for languages with contour tones including Mandarin, where pitch contour plays a primary role in tone identity. It would be of interest for future studies to investigate the role of absolute pitch in processing tones in languages with level tones, which presumably would rely critically on the detection of pitch level. Furthermore, although F0 is the primary acoustic correlate of lexical tones, it is not the only acoustic cue that listeners could use to uncover tone identity. As with all phonetic contrasts, there are multiple acoustic cues to lexical tone contrasts including intrinsic duration and amplitude C. Lee and T. Hung: Lexical tone and musical pitch perception

共e.g., Whalen and Xu, 1992兲, which explains why whispered tones could be identified despite missing F0 information completely 共Liu and Samuel, 2004兲. In other words, to identify lexical tones is to process all relevant information from the output of a vocal tract and map that information onto linguistically significant categories. Although pitch is clearly implicated in lexical tone processing, processing lexical tone as a linguistic contrast is not constrained to processing pitch information. Finally, just as they do for consonants and vowels, tone language speakers use lexical tones to encode a linguistic message, and listeners decode acoustic patterns to uncover the tones in order to decipher a linguistic message. That is, as with any phonetic distinctions, lexical tones are linguistic contrasts. The association between an acoustic pattern and a verbal label can be acquired without explicit instruction by normally developing native speakers of a tone language. In contrast, the verbal label associated with absolute pitch 共i.e., note names兲 has to be taught. While pitch processing is no doubt implicated in both tasks, the functional contrast in whether pitch is interpreted linguistically could be a critical difference 共e.g., Remez et al., 1981兲. The extent to which common processing mechanisms are involved in both domains awaits future research. V. CONCLUSION

The present study contributes to the literature on language-music relationship in several ways. The Mandarin tone identification experiment showed that the advantage of musical training in lexical tone perception depends on the nature of the acoustic input. When the amount of tonal information was reduced, the identification performance difference between the musicians and nonmusicians also became less. Furthermore, the acoustic modifications also revealed tone-specific effects of reduction; namely, tone 2 and tone 4 were minimally compromised even when the majority of the syllable center was absent. Finally, by administering both the Mandarin tone and the absolute pitch task on the same musicians, we attempted to evaluate the association between these two tasks and thus the hypothesis that absolute pitch is implicated in lexical tone perception. Although our data could not fully evaluate the correlations between the two tasks due to the absence of absolute pitch in the musicians, the musicians clearly outperformed the nonmusicians in Mandarin tone identification, suggesting that abilities other than absolute pitch contributed to the musicians’ superior performance in Mandarin tone identification. Future work could further explore the basis of the advantage. ACKNOWLEDGMENT

We are grateful to Diana Deutsch and an anonymous reviewer for their valuable comments. We also thank the School of Music at Ohio University for making the Kurzweil K2000 synthesizer available for stimulus construction and Fuh-Cherng Jeng for synthesizing the pure tone stimuli. We thank Sarah Letsky for assistance in administering the experiments, Ning Zhou, En Ye, Anne Marie Christy, and Gayatri Ram for assistance in data processing, and Li Xu and J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

Z. S. Bond for discussions. This research was partially supported by professional development funds from the School of Hearing, Speech and Language Sciences and the Honors Tutorial College at Ohio University. Alexander, J., Wong, P. C. M., and Bradlow, A. 共2005兲. “Lexical tone perception in musicians and nonmusicians,” Proceedings of Interspeech’ 2005–Eurospeech–Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal. Anvari, S. H., Trainor, L. J., Woodside, J., and Levy, B. A. 共2002兲. “Relations among musical skills, phonological processing, and early reading ability in preschool children,” J. Exp. Child Psychol. 83, 111–130. Blicher, D. L., Diehl, R. L., and Cohen, L. B. 共1990兲. “Effects of syllable duration on the perception of the Mandarin tone 2/tone 3 distinction: Evidence of auditory enhancement,” J. Phonetics 18, 37–49. Burnham, D., Peretz, I., Stevens, K., Jones, C., Schwanhäusser, B., Tsukada, K., and Bollwerk, S. 共2004兲. “Do tone language speakers have perfect pitch?,” Proceedings of the Eighth International Conference on Music Perception and Cognition, Evanston, IL, edited by S. D. Lipscomb, R. Ashley, R. O. Gjerdingen, and P. Webster, Causal Productions, Adelaide, Australia, p. 350. Deutsch, D. 共1991兲. “The tritone paradox: An influence of language on music perception,” Music Percept. 8, 335–347. Deutsch, D. 共2006兲. “The enigma of absolute pitch,” Acoustics Today 2, 11–19. Deutsch, D., Henthorn, T., and Dolson, M. 共2004a兲. “Absolute pitch, speech, and tone language: Some experiments and a proposed framework,” Music Percept. 21, 339–356. Deutsch, D., Henthorn, T., and Dolson, M. 共2004b兲. “Speech patterns heard early in life influence later perception of the tritone paradox,” Music Percept. 21, 357–372. Deutsch, D., Henthorn, T., Marvin, E., and Xu, H. 共2006兲. “Absolute pitch among American and Chinese conservatory students: Prevalence differences, and evidence for a speech-related critical period,” J. Acoust. Soc. Am. 119, 719–722. Deutsch, D., North, T., and Ray, L. 共1990兲. “The tritone paradox: Correlate with the listener’s vocal range for speech,” Music Percept. 7, 371–384. Dolson, M. 共1994兲. “The pitch of speech as a function of linguistic community,” Music Percept. 11, 321–331. Gottfried, T. L. 共2007兲. in Language Experience in Second Language Speech Learning, edited by O.-S. Bohn, and M. J. Munro 共John Benjamins, Amsterdam兲, pp. 221–237. Gottfried, T. L., and Riester, D. 共2000兲. “Relation of pitch glide perception and mandarin tone identification,” J. Acoust. Soc. Am. 108, 2604. Gottfried, T. L., Staby, A. M., and Ziemer, C. J. 共2001兲. “Musical experience and mandarin tone discrimination and imitation,” J. Acoust. Soc. Am. 115, 2545. Gottfried, T. L., and Suiter, T. L. 共1997兲. “Effects of linguistic experience on the identification of mandarin Chinese vowels and tones,” J. Phonetics 25, 207–231. Greenberg, S., and Zee, E. 共1979兲. “On the perception of contour tones,” UCLA Working Papers in Phonetics 45, 150–164 共http:// repositories.cdlib.org/uclaling/wpp/No45/. Honorof, D. N., and Whalen, D. H. 共2005兲. “Perception of pitch location within a speaker’s F0 range,” J. Acoust. Soc. Am. 117, 2193–2200. Johnson, K. A. 共2005兲, in The Handbook of Speech Perception, edited by D. B. Pisoni and R. E. Remez 共Blackwell, Malden, MA兲, pp. 363–389. Ladefoged, P. 共2003兲, Phonetic Data Analysis: An Introduction to Fieldwork and Instrumental Techniques 共Blackwell, Malden, MA兲. Leather, J. 共1983兲. “Speaker normalization in perception of lexical tone,” J. Phonetics 11, 373–382. Lee, C.-Y., Tao, L., and Bond, Z. S. 共2006兲. “Native and non-native identification of acoustically modified mandarin tones,” J. Acoust. Soc. Am. 120, 3175. Lee, C.-Y., Tao, L., and Bond, Z. S. 共2008a兲. “Identification of acoustically modified mandarin tones by native listeners,” J. Phonetics 共in press兲. Lee, C.-Y., Tao, L., and Bond, Z. S. 共2008b兲. “Speaker variability and context in the identification of fragmented mandarin tones by native and nonnative listeners,” J. Phonetics 共in press兲. Lin, T., and Wang, W. S.-Y. 共1984兲. “Shengdiao ganzhi wenti 共The issue of tone perception兲,” Zhongguo Yuyan Xuebao 共Bull. Chin. Linguist.兲 2, 59– 69. Liu, S., and Samuel, A. G. 共2004兲. “Perception of mandarin lexical tones C. Lee and T. Hung: Lexical tone and musical pitch perception

3247

when F0 information is neutralized,” Lang Speech 47, 109–138. Lockhead, G. R., and Byrd, R. 共1981兲. “Practically perfect pitch,” J. Acoust. Soc. Am. 70, 387–389. Mertus, J. A. 共2000兲. The Brown Lab Interactive Speech System 共Brown University兲, http://www.mertus.org/Bliss/index.html. Last viewed 9/29/08. Moore, C. B., and Jongman, A. 共1997兲. “Speaker normalization in the perception of mandarin Chinese tones,” J. Acoust. Soc. Am. 102, 1864–1877. Ohio University School of Music Undergraduate Handbook 共2007–2008兲. Available at http://www.finearts.ohio.edu/music/gfx/media/pdf/ undergradhandbook2007.pdf. Last viewed 8/29/08. Patel, A. D. 共2008兲, Music, Language, and the Brain 共Oxford University Press, New York兲. Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. 共1981兲. “Speech perception without traditional speech cues,” Science 212, 947–950. Saffran, J. R., and Griepentrog, G. J. 共2001兲. “Absolute pitch in infant auditory learning: Evidence for developmental reorganization,” Dev. Psychol. 37, 74–85. Slevc, R., and Miyake, A. 共2006兲. “Individual differences in second-

3248

J. Acoust. Soc. Am., Vol. 124, No. 5, November 2008

language proficiency: Does musical ability matter?” Psychol. Sci. 17, 675– 681. Strange, W., Jenkins, J. J., and Johnson, T. L. 共1983兲. “Dynamic specification of coarticulated vowels,” J. Acoust. Soc. Am. 74, 695–705. Tseng, C.-Y. 共1981兲. “An Acoustic Phonetic Study on Tones in Mandarin Chinese,” Ph.D. thesis, Brown University. Whalen, D. H., and Xu, Y. 共1992兲. “Information for mandarin tones in the amplitude contour and in brief segments,” Phonetica 49, 25–47. Wong, P. C. M., and Diehl, R. L. 共2003兲. “Perceptual normalization for interand intra-talker variation in Cantonese level tones,” J. Speech Lang. Hear. Res. 46, 413–421. Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., and Kraus, N. 共2007兲. “Musical experience shapes human brainstem encoding of linguistic pitch patterns,” Nat. Neurosci. 10, 420–422. Xu, L., Tsai, Y., and Pfingst, B. E. 共2002兲. “Features of stimulation affecting tonal-speech perception: Implications for cochlear prostheses,” J. Acoust. Soc. Am. 112, 247–258. Zhou, N., Zhang, W., Lee, C.-Y., and Xu, L. 共2008兲. “Lexical tone recognition with an artificial neural network,” Ear Hear. 29, 326–335.

C. Lee and T. Hung: Lexical tone and musical pitch perception