Five native speakers of Mandarin Chinese who speak English as a second .... (measured by listener ratings on a Likert-type scale) and perceived foreign accent (also ...... US Dept of Homeland Security, Bureau of Citizenship and Immigration ...
Improving Intelligibility of Non-Native Speech with Computer-Assisted Phonological Training Deborah F. Burleson, Department of Linguistics, Indiana University & School of Medicine, IUPUI
Abstract Five native speakers of Mandarin Chinese who speak English as a second language, and a control subject, participated in a 15-hour training regime practicing pronunciation of six phonemic contrasts. Subjects were recorded reading a list of isolated English words, randomly ordered from minimal pairs drawn from the six contrasts, the list including words which would appear in the training and words which would be absent from training. The training was computer-based and did not involve corrective responses from an instructor. The trainees spoke into a microphone, providing input to a computerized speech recognizer that evaluated the pronunciation and provided feedback to the trainee. Following the training period, all six speakers were again recorded reading the wordlist. Native speakers of English were asked to identify each token as one or the other member of a minimal pair presented in a forced choice task. Correct identifications of pre-training tokens were at chance level, regardless of speaker, contrast, or whether the word had been included in or excluded from the training program. Correct identifications of post-training tokens averaged 89% over all speakers, all contrasts and regardless of inclusion or exclusion of a token in the training program. The results indicate that production of segments in isolated words can effectively be trained via computer-based administration and that such training generalizes to the same phonological contrast in non-trained words.
1 Introduction 1.1 L1 Influence on L2 Production Errors Though the degree of influence of a speaker’s first language (L1) on his production and perception in a second language (L2) is debated, it is generally accepted that foreignaccented speech productions reflect the differences between an individual’s native and target languages in the areas of phonemic inventories, allophonic variations and phonotactic constraints. The obvious difficulties of producing a phoneme not found in the native inventory – an English speaker struggling with Zulu click consonants, for instance – determine some errors. Other L1 influences are more complex. Flege (1987) reports that English speakers learning French fail to acoustically produce the French /u/ because of the existence in their L1 inventory of a phonetically-similar (though different) vowel, the English /u/.
A still subtler source of errors in second language pronunciation is found at the level of phonetic detail, as illustrated in Rochet’s (1995) investigation of English and Portuguese speakers’ productions of the target French vowel /y/. French has three distinct high vowels (/i/, /y/, and /u/), while both English and Portuguese have only two distinct high vowels (/i/ and /u/). Yet English speakers are reported to produce the French /y/ as /i/ while Portuguese speakers produce it as /u/. Results of Rochet’s imitation and perception experiments suggest that it is the L2 learners’ perceptions of the boundary locations separating adjacent categories that motivate their accented pronunciations of L2 sounds. In summary, pronunciation errors in English speech productions will differ characteristically according to speakers’ native languages. L1 to L2 pronunciation errors can be anticipated, and they can be anticipated both by performing phonemic contrastive analyses of target and source languages and by conducting error analysis investigations of L1 speakers producing L2 speech. 1.2 L2 Errors’ Influence on Intelligibility Research into the effect of pronunciation errors on intelligibility has not reached unanimous conclusions. Speech and hearing science studies have reported negative correlations between the frequency of segmental errors and speech intelligibility. Smith (1975) reported a correlation of -.80 in her examination of the effect on intelligibility of deaf children’s segmental errors. Smith notes that there is a fair amount of dispersion within the data and offers that it may be due in part to suprasegmental errors (p. 805). Examining the relationship between the sentence intelligibility of deaf speakers and acoustic properties at the segmental and suprasegmental level, Monsen (1978) found that the majority of variance in the intelligibility of sentences was related to variations at the segmental level. He, too, however, qualifies the finding with the observation that the significance of some prosodic measurements is unclear and that the variables are not as easily measured as the segmental variables under examination. In the field of linguistics, research into the effect of segmental errors made by L2 speakers of English has provided results that point to a strong relationship between segmental errors and degraded intelligibility. Rogers and Dalby (1996) performed multiple regression analyses of the results of a listening task where native English speakers were asked to (1) identify, in a forced-choice task, word members of minimalpairs recorded by Non-native Speakers (NNSs) and (2) transcribe the content of semantically-unrelated sentences recorded by NNSs. To control frequency of phoneme occurrence within sentences, phoneme frequency within sentences was matched to phoneme frequency in the English language overall. A moderately strong correlation was shown between word identification within minimal pairs and sentence intelligibility, and an even stronger relationship was evident when the phoneme-contrast variables were grouped by distinctive feature category and correlated to sentence intelligibility. Other work identifies the need to specify how intelligibility is defined and evaluated and to distinguish it from the closely-related measures of comprehensibility 2
and perceived accent. Munro and Derwing (2000) looked for correlations between intelligibility (measured by written transcription of NNS utterances), comprehensibility (measured by listener ratings on a Likert-type scale) and perceived foreign accent (also measured by listener ratings). On a listener by listener basis, comprehensibility and intelligibility were correlated to each other far more frequently than either measure was related to perceived accent. Utterances could be accurately transcribed and judged as highly comprehensible while receiving ratings of being strongly foreign-accented, an important implication in the field of pronunciation assessment. The distinctions between the measures of intelligibility, comprehensibility and perceived accent are also important because they bear on the interpretation of study results. The current study begins with the assumption that predictable segmental errors undermine the success of communication in NNS utterances. A spoken member of a minimal pair may be heard clearly (comprehended) and accurately transcribed as it was spoken (intelligible) and still render an unintended message, e.g., as in “I like the smell of the beach/peach.” It is not the purpose of this study to address to what varying degrees and with what overlap these predictable segmental errors contribute to the related measures of intelligibility, comprehensibility and accent. Similarly, as demonstrated by Gass and Varonis (1984), native speakers in comprehension tasks use pronunciation fluency to compensate for their lack of familiarity with discourse topic and find comprehensibility impeded when listening to nonnative speech. This further underlines the importance of accurate segmental production in conversations between native and nonnative speaker where novel topic content may be expected. The effects examined within this paper will focus on intelligibility, defined here as the degree to which a speaker’s words are correctly identified as spoken. The method of measuring intelligibility within this study does not account for evaluation at a level greater than word level and thus cannot include degrees of error gravity which might be found in analyses of sentence or paragraph level where the importance of individual words vary by type. 1.3 L2 Errors’ Effect on Listener Processing Time Productions by NNSs may take a toll on listeners in another way. Pisoni, Nusbaum and Greene (1985), measuring response latencies in classification tasks, demonstrated that extra cognitive effort was required to process intelligible, but synthetic, speech. Speech researchers have carried this issue into the study of foreign-accented speech. Munro and Derwing (1995) hypothesized that “the time required for recognition of accented consonant and vowel segments may be greater if those segments differ considerably from category prototypes….Increased processing time may also result from a lack of comprehension or miscomprehension of lexical items, which might necessitate special top-down processing” (p. 290).
3
In their investigation they recorded the response latencies of listeners assessing the truth value of utterances spoken by both NSs and NNSs. Using only the data from those utterances which had been correctly transcribed and judged truthful or not, and allowing for the effect of variable speaking rate, their findings support the premise of processing costs in listening to non-native speaker productions. Because they also asked their listeners to rate each utterance on comprehensibility and on degree of accent, they were able to note a correlation between lowered comprehensibility and increased processing time while also observing no correlation between strength of perceived accent and increased response time. The fact that more actual errors were made by jurors in correctly assigning truth values and in transcribing the speech of NNSs adds a second cost factor in the communication process in their investigation. In a related study, Schmid and Yeni-Komshian (1999) asked listeners to detect and react to words containing mispronunciations as spoken by both native and non-native speakers within the context of connected speech. Listeners spent significantly more time reacting to and correctly identifying mispronunciations by NNSs than by NSs suggesting that “listeners may have had to expend more effort to decode these messages (NNS productions) than the ones produced by native speakers” (p. 62). 1.4 Remediation of L2 Segmental Errors and Resulting Improved Intelligibility Professionals who teach English as a second language will admit to the difficulty of obtaining positive results when training students in the production of new phonemic contrasts. Research has sometimes supported this reservation. Flege, Munro and Skelton (1992) reported on the production of the word-final English /t/-/d/ contrast by Spanish and Mandarin speakers. The goal of that study was to determine whether speakers whose first language does not include a word-final /t/ - /d/ contrast could, with sufficient nativespeaker input, produce this contrast in English. Their results reported that native Englishspeaking listeners identified the stop productions by experienced L2 learners at rates significantly below those reported when identifying stops produced by native English speakers. Furthermore, the productions of experienced L2 learners were not significantly better identified than those of inexperienced L2 learners. Possible explanations include both the amount and acoustic quality of native speaker phonetic input for these learners. Yet research exists which indicates that L2 phonemic errors can be remediated with training. Using visual feedback in the form of spectrogram presentations of model productions of the American English /r/ (in contexts of word-initial /r/, vowel + /r/ and /r/ within clusters), Muraka and Lambacher (2000) trained native Japanese speakers to match their own spectrographic patterns to native speaker models. Production, measured by comparing pre- and post-training achievement of target ranges for formant frequencies, improved significantly in each of the contexts. In another use of visualization training methodology, Dowd, Smith and Wolfe (1997) used real time display of the first two frequencies of vocal tract resonances to successfully teach the production of non-nasalized French vowels to native English speakers inexperienced with 4
speaking any foreign language. Successful production here was determined by identification of spoken segments by a listening jury of native French speakers. Evidence exists that improved segmental productions contribute positively to improved intelligibility. In 1994, Rogers, Dalby & DeVane, using an early version of the same computer-assisted pronunciation training system evaluated in the current study, provided both traditional auditory feedback from a speech-language pathologist and automatic word recognition evaluative feedback to native Mandarin Chinese speakers practicing producing English minimal pairs. Their pre- and post-training intelligibility scores by a jury of native listeners improved in both vowel and consonant productions and on both trained and untrained words containing the same targeted contrasts. The fact that segmental correction can improve intelligibility was also a conclusion reached in the research of deaf speech. Maassen and Povel (1985) using digital signal processing techniques, replaced segmental qualities in the speech of deaf Dutch children with those of hearing speakers, resulting in intelligibility rates which increased from 24% to 72%. An encouraging implication for intelligibility improvement beyond the level of isolated words has also been documented. Rogers (1997) found a strong relationship between segmental errors and connected-speech intelligibility for Chinese-accented English and concluded that some error types were more strongly linked to intelligibility than others. Her results showed that vowel contrasts are more significantly associated with intelligibility than consonant contrasts. Though less stringently limited to segmental training, other researchers have reported increased intelligibility following pronunciation training. Ferrier (1991) describes an instructional program administered to foreign teaching assistants at Northeastern University in Boston. Following a weekly program of instruction, which heavily emphasized training in individual error sounds (determined by diagnostic assessment), the administration of a student survey on comprehensibility indicated that 83% of the students served by the trained foreign teaching assistants noted improvement in their instructor’s intelligibility. Similarly, Perlmutter (1989) reports improvements on both a scalar ranking of intelligibility and on accurate identification of discourse topic following instruction given to international teaching assistants in both segmental and suprasegmental production. 1.5 Need for an Effective Teacher-Independent Trainer Attention to US population and to national education statistics provides evidence of the importance of effective and affordable tools in English language instruction, including pronunciation training, for non-native speakers. In the past 20 years the number of individuals aged 5 to 24 who speak a language other than English at home increased from 6.3 million in 1979 to 13.7 million in 1999. The number of immigrants granted legal permanent resident status in the US in 2001 and 2002 averaged 1,064,025, an increase of 42% over the same measure for the years 1999 and 2000. Visiting and foreign students are not included in these numbers.
5
In her discussion of the use of multimedia learning aids in teaching pronunciation, Celce-Murcia (1996) notes their advantages, including “1) access to a wide variety of NS speech samplings, 2) sheltered practice sessions in which the learner can take risks without stress and fear of error, 3) opportunity for self-pacing and self-monitoring of progress, 4) one on one contact without a teacher’s constant supervision…” (p. 313). A system like the one investigated in the current study, if shown to be a successful method of improving the intelligibility of segmental productions, would have the potential to benefit both learners and instructors by providing these very advantages. 1.6 The Current Study A commercially-available computer-based pronunciation training product developed by speech science specialists and linguists was selected to provide production training to five native speakers of Mandarin Chinese who speak English as a second language. The product, commercially known as HearSay for Mandarin Chinese Speakers produced by Communication Disorders Technology, Inc., Bloomington, Indiana (Dalby, Kewley-Port and Sillings 1998), bases its content material on an error analysis of the speech of Mandarin Chinese speakers using English as a second language (Dalby and Kewley-Port 1999). This error analysis yielded a set of 80 segmental contrasts. Each contrast offered a list of minimal pairs made up of target words paired with words representing the expected error. For example, the pair “pitch/peach” represents a vowel error common among Mandarin-speaking learners of English. An English consonant error common to native Mandarin speakers is represented by the pair “dog/dock.” This experiment identified six American English contrasts erroneously produced by the subject pool (native Mandarin Chinese speakers) and administered a production training program. Recordings of target words (half to be included in the training program, half to be retained for generalization testing) containing these contrasts were made by each subject before and after training and subjected to forced choice discrimination testing by native English listeners. This study tested two hypotheses - first, that second language learners can improve and generalize segmental intelligibility through minimal pairs production training, and second, that this training can be conducted effectively using a computerbased system in which feedback is provided by automatic speech recognition technology without instructor interaction. 2 Methods 2.1 Subjects Two female and four male native speakers of Mandarin Chinese between the ages of 28 and 43, all reporting normal speech and hearing, were selected as subjects for training from a pool of 21 applicants who had responded to the investigator’s e-mail call for subjects. Time in residence in the US ranged from four months for two subjects to eight
6
years for two subjects (Table 1). Subject selection followed oral interviews designed to screen for common error patterns to ensure that the same six contrasts could be examined across subjects. One subject was used as a control and was recorded before training commenced and after training concluded but received no training. Four of the subjects were graduate students in either science or business at Indiana University in Bloomington, Indiana. The remaining two subjects were professionals working within the Bloomington community. Table 1. Subject information.
Subject Backgrounds: Experimental Subjects S1-S5 and Control
S1 S2 S3 S4 S5 Control
Age/Gender 33 yrs - F 37 yrs - F 33 yrs - M 28 yrs - M 43 yrs - M 30 yrs - M
Time in US 3 mos 132 mos 22 mos 14 mos 132 mos 3 mos
Native TOEFL Lang Score Mandarin 623 Mandarin 550 Mandarin NA Mandarin 613 Mandarin 575 Mandarin 620
2.2 Pre-Training Recordings During the week before training began, each subject (including the control subject) was recorded using a Sony TCD-D7 Portable Digital Audio Tape Recorder and a Shure 515SDX table-mounted microphone, reading the same list of randomly ordered target words taken from six American English contrasts. Subjects reviewed all words prior to recording to ensure that pronunciation did not reflect unfamiliarity with any word. In such cases, word pronunciation was modeled and word meaning was defined by the experimenter to the subject’s satisfaction. The 864 tokens (6 training words plus 6 generalization words x 6 contrasts x 6 subjects x 2 repetitions) were digitized to 16-bit 22,050 kHz monophonic samples using Sonic Foundry’s SoundForge XP 4.0 for Windows and hand-edited so that each word was preceded and followed by 200 ms. of silence. These digitized waveforms comprised a stimulus set presented to native English listeners as described below. The words forming the recording list are shown in Table 2 which lists the six contrasts trained within target words and the expected error for each target word. Also shown are the words excluded from training to be used in generalization tests. The contrasts trained in this study included four word final voicing contrasts (/b#/ - /p#/, /d#/ /t#/, /g#/ - /k#/, and /z#/ - /s#/), one place contrast (/n#/ - /ŋ#/), and one lax/tense vowel contrast (/I/ - /i/). 7
2.3 Training After the recordings were made, all subjects except the control subject, began a training schedule that consisted of three one-hour training sessions on HearSay for Mandarin Chinese Speakers each week (Monday, Wednesday, Friday) for five consecutive weeks. Subject training times did not overlap, and subjects did not encounter or interact with each other. In each session subjects trained for nine minutes on each of the six contrasts, following the same order from subject to subject and from session to session. Though the HearSay training product was not modified, the experiment instructions limited subjects to training on six contrasts and required them to perform their practice drills in a uniform fashion as described below. Wearing a combination earphone/microphone headset, a trainee logged into HearSay and opened the first training contrast identified in this experiment (target /b#/, expected error /p#/). This action accessed a list of target words containing /b/ in word final position. When subjects clicked to select the first word in the list, ‘robe,’ for example, HearSay played a native speaker audio model of ‘robe,’ randomly selected from a set of four digitized auditory models. After listening to the model pronunciation, subjects pressed the keyboard’s spacebar to ready the speech recognizer to evaluate an upcoming utterance by loading templates of both the target word ‘robe’ and expected error word ‘rope.’ If the utterance was recognized more closely as the target word, HearSay returned a chime sound accompanied by a text message reading, “Yes. The computer heard robe.” If the utterance was recognized more closely as the expected error word, HearSay returned a buzzer sound accompanied by a text message reading “No. The computer heard rope.” In order to ensure consistent training time for contrasts across subjects, the experimenter sat with each subject and gave start and stop times for each contrast, allowing a total of nine minutes per contrast. Subjects were informed that the experimenter would not provide instruction or correction. When subjects’ pronunciations of the target word were evaluated as incorrect by the system, they were allowed to listen to the computer’s audio models and retry their pronunciation with speech recognition feedback as many times as they wished before moving to the next word in that contrast list. The number of words in each contrast’s wordlist varied from contrast to contrast because HearSay’s training curriculum weights contrasts to reflect their importance to overall English intelligibility. The shortest list contained 11 word pairs (b / p and g / k contrasts) while the longest list contained 36 word pairs (d / t contrast). Regardless of the number of words per contrast, nine minutes allowed sufficient time for subjects to practice all words in each contrast. When subjects finished the list before nine minutes elapsed, they began again from the top of the list for that contrast.
8
Contrast
Training Words Target Expected Word Error
Generalization Words Target Expected Word Error
b# / p#
cob
cop
tab
tap
robe cab
rope cap
swab nib
swap nip
lab mob
lap mop
lob lobe
lop lope
rib
rip
gab
gap
wade bed
wait bet
toad rode
tote wrote
bead bid
beat bit
cad glowed
cat gloat
card code
cart coat
pad mad
pat mat
z# / s#
peas perches phase prize spies graze
peace purchase face price spice grace
bays lazy knees pays rays lies
base lacy niece pace race lice
n #/ _ #
ran run gone kin coffin lawn
rang rung gong king coughing long
fan sun prawn pin sinner tan
fang sung prong ping singer tang
I/i
candid fills hid pit rip tin
candied feels heed peat reap teen
pick dip fist lick sick sip
peek deep feast leek seek seep
g/k
anger angle bigger bug dug flag
anchor ankle bicker buck duck flak
wig sag pig league hog degrees
wick sack pick leak hawk decrees
d# / t#
Table 2. Recording Wordlist - By Contrast (Target Word and Expected Error Word)
9
As subjects reached the criterion accuracy level of 85% in any single contrast as judged by the HearSay system for three consecutive training sessions, that contrast was removed from training, except that subjects were required to review the contrast by attempting each member word one time. If accuracy during this review performance slipped below 85%, the contrast was added back to the training regime until criterion performance was once again reached. 2.4 Post-Training Recordings Within a week following the five-week training period, each subject (including the control) was again recorded with the procedures and wordlist used in pre-training recordings. The list included training words as well as generalization words (Table 2). The resulting 864 tokens were digitized following the procedure previously described. 2.5 Identification by Native Listeners Five native speakers of American English (three females and two males) reporting normal speech and hearing were selected to serve as a native-speaker listening jury. All were undergraduates at Indiana University, ranging in age from 20 to 29, with no training in phonetics or any special knowledge of linguistic theory. None reported bias regarding non-native pronunciation or members of populations outside the US. Jurors heard stimuli as wavefiles played via a computer and presented binaurally through headphones during the listening task. Since the full set of 1,728 wavefiles (864 pre-training tokens + 864 post-training tokens) was too large for a single sitting, a subset of 432 word tokens with equal representation across conditions was selected for the listening task. Three words were selected from each talker for each contrast for each set of training and generalization targets. For talker 1, training and generalization words 1-3 were selected. For talker 2, words 2-4 were selected, and so forth, as shown in Table 3 below. Talker Words 1 1,2,3 2 2,3,4 3 3,4,5 4 4,5,6 5 5,6,1 6 6,1,2 Table 3. Words selected for listener judgment To test for intrajuror reliability, 36 pre-training tokens and 36 post-training tokens were selected equally across conditions to be repeated within the listening task, resulting in a total of 504 tokens to be judged by each juror. Listeners were not told that they were judging word tokens from non-native speakers.
10
After being given instructions and a short training module, listeners initiated a program which presented them with a screen showing both members of a minimal pair (randomly presented in target/error or error/target order). Presentation of pairs was randomized across contrasts, speakers, and training vs. generalization words. When a word pair appeared on the screen, listeners mouse-clicked a “GO” symbol in order to hear a token. After hearing the token, they selected which of the two words on the screen they believed the token to have been. The contrast being judged was highlighted in both members of the word pair in order to focus listeners’ judgments when a recording might not sound like either of the choices (e.g., robe…rope). Following identification from each presented pair, listeners were prompted to mouse-click when ready to make the next judgment, and the process was repeated for all tokens. The listening task was presented in 3 sections with a 7-minute break between sessions and lasted approximately 45 minutes. 3 Results 3.1 Overall Effects of Training Intrajuror reliability ranged from 81% to 88% with a mean of 83.1%. Listeners were equally reliable from juror to juror. At least four out of five listeners made identical decisions in 81% of the judgments. The effect of the intelligibility training is shown in Figure 1. The training effect was quite large with pre/post identification scores across speakers and contrasts at 46% and 87%, respectively, for those words included in the training regime and, similarly, 55% and 93% for those words excluded from the training program but containing the same contrast. A repeated measures ANOVA was run using pre- and post-training times as within-subjects factor and word set (training words vs. words excluded from training, ie generalization words) as between-subjects factor. The dependent variable was the percent of words correctly identified pre- and post-training by NS jurors. The effect of training was significant (F (1,58) = 82.619, p = .000), while the interaction of training with word set (training words vs. generalization words) was not significant (F (1,58) = .051, p = .822, indicating that the trainees made similar improvement on words which had been excluded from training but contained the trained phonological contrasts. Scores for the control individual were not different over the course of five weeks, with both preand post-test responses at chance levels, demonstrating that the changes in the trainees’ intelligibility scores were not simply due to the extra five weeks of U.S. residency that elapsed between the pre- and post-test recordings.
11
% correctly identified
1
.93 .87
0.9 0.8
PRE
0.7
POST
0.6
0.55 0.46
0.5
0.47 .46
0.4 Expermtl Ss Training Wds
Expermtl Ss Genlztn Wds
Control S All Wds
Figure 1. Pooled Results - Training Effects.
3.2 Results by Subject Figure 2 displays pre- and post-training identification scores by subject. The results are reported across all contrasts and across both generalized and trained words. Large training effects are readily apparent in every subject. Pre- to post-training score improvement ranged from 30% (S2) to 47% (S4), with a mean improvement of 39%.
% correctly identified
(Across Trained v. Gen Words, Across Contrasts)
1.00
0.97
0.90
0.96 0.82
0.89
0.87
0.80 PRE
0.70 0.60
.52
.52
.55
0.50
POST .49
.44
0.40 S1
S2
S3
S4
S5
Experimental Subject
Figure 2. Post-training improvement by subject.
Figure 3 shows post-training scores by both trained and generalization categories, demonstrating the similarity between post-training performance in trained and generalization words for all subjects.
12
100
pre
90
post (trained)
% correctly identified
80
post (generalized)
70 60 50 40 30 20 10 0 S1
S2
S3
S4
S5
Experimental Subjects
Figure 3. Post-training improvement by subject with results reported by trained vs. generalized words.
3.3 Effects by Phonological Contrast As described in the Methods Section, the contrasts trained included a vowel contrast, a nasal contrast in place, and four word-final voicing contrasts. Subjects before training were considerably better with the vowel and place contrasts than the voicing contrasts. Complementing this fact, the final accuracy rates for these two contrasts are lower than for the voicing contrasts as well, both facts contributing to much smaller improvements for these contrasts. To examine these effects in more detail, the following section examines performance of the individual subjects.
Experimental Subjects by Contrast
%Correctly Identified
100 80 pre
60
post
40 20 0 b#/p#
d#/t#
g/k
I/i
n#/_#
z#/s#
Figure 4. Results by Phonological Contrast – (Training and Generalization Words)
13
3.3.1 Place Contrast (/n/ - /ŋ/) by Subject Figure 5 shows that four of the subjects entered training with identification performance on /n/ - /ŋ/ of at least 70% correct. The mean improvement on /n/ - /ŋ/ for those four subjects was only 12%; suggesting that they may have begun training already near their optimal performance. Mitigating this observation is particularly S1 who shows marked improvement over the initial 76% accuracy rate. The remaining speaker, S2, with the lowest pre-test score (27% correct) improved from 27% to 63% correct identification, making the largest gain from the lowest starting performance.
100
% correct identified
90 80 70 60
pre
50 40
post
30 20 10 0 S1
S2
S3
S4
S5
Experimental Subjects
Figure 5. Place Contrast (/n/ - /ŋ/) Results by Speaker – Pooled across Words.
3.3.2 Vowel Contrast (/I/ - /i/) by Subject Figure 6 offers insight into the observation that contrast /I/ - /i/ demonstrates less posttraining improvement than do the other contrast groups. Two subjects (S2 and S3) actually showed deterioration in the native listener identification of their productions in words containing the /I/ - /i/ contrast, S2 falling slightly (93% pre-training to 84% posttraining identification) and S3 dropping from a pre-training identification of 77% to a post-training score of only 47%.
14
100 90
% correct identified
80 70 60
pre
50
post
40 30 20 10 0 S1
S2
S3
S4
S5
Experimental Subjects
Figure 6. Tense/Lax Vowel Contrast (/I/ - /i/) Results by Speaker – Pooled across Words.
Though more data is required to confirm it, an explanation which can be considered is that these subjects were influenced by the preponderance of voicing contrasts in the training program and having learned to focus on segment duration, specifically vowel lengthening before a voiced consonant, they carried this performance into the words containing the lax/tense vowel contrast in /I/ - /i/ and inappropriately lengthened the lax /I/. 3.3.3 Word Final Voicing Contrasts by Subject Figure 7 shows that all five subjects were able learners of the four final segment voicing contrasts. Though actual acoustic measurements would be of interest in any of the segmental contrast areas studied, they may be of special interest here, given the suggestion that word final voicing improvement effected tense/lax vowel production. 100
% correct identified
90 80 70 60
pre
50 40
post
30 20 10 0 S1
S2
S3
S4
S5
Experimental Subjects
Figure 7. Final Voicing Contrasts By Speaker – Pooled across Words 15
4 Summary This study demonstrates that evaluative feedback from a computerized, speech recognition-based pronunciation trainer administered in a brief, structured training program without teacher interaction can produce measurable improvement in the intelligibility of American English segments produced in isolated words by Mandarin Chinese speakers of English as a second language, and that this improvement generalizes from trained words to untrained words containing the trained contrasts. The fact that these results were achieved with training feedback from a commercial computer-based system, absent of teacher interaction, and over the relatively short period of 15 training hours is encouraging for both teachers and students of American English pronunciation who are necessarily limited in opportunities for lengthy one-on-one and face to face tutoring. Extension of this study with a larger number of subjects, consideration of more extensive pool of contrasts, and testing of performance retention and extension to running speech intelligibility would be valuable next steps.
16
References Celce-Murcia, m., Brinton, D., & Goodwin, J. (1996). Teaching Pronunciation, A Reference for Teachers of English to Speakers of Other Languages. UK: Cambridge University Press. Dalby, J., Kewley-Port, D. & Sillings, R. (1998). Language-specific pronunciation training using the HearSay system. Proceedings of the European Speech Communication Association conference on Speech Technology in Language Learning, 25-28. Dalby, J. & Kewley-Port, D. (1999) Explicit pronunciation training using automatic speech recognition technology. In Holland, M., (Ed.), Tutors that Listen: Speech Recognition for Language Learning, Special Issue of the Journal of the computer Assisted Language Learning Consortium, 16 (5), 425-445. Dowd, A., Smith, J. & Wolfe, J. (1997). Learning to pronounce vowel sounds in a foreign language using acoustic measurements of the vocal tract as feedback in real time. Language and Speech 41 (1), 1-20. Ferrier, L. J. (1991). Pronunciation training for foreign teaching assistants. ASHA, April 1991, 65-70. Flege, J. E. (1987). The production of ‘new’ and ‘similar’ phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 4765. Flege, J. E., Munro, M. J. & Skelton, L. (1992). Production of the word-final /t/-/d/ contrast by native speakers of English, Mandarin, and Spanish. J. Acoustical Society of America 92 (1), 128-143. Gass, S. & Varonis M. (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, (34) 1, 65-89. Maassen, H. S. & Povel, D. J. (1985). The effect of segmental and suprasegmental corrections on the intelligibility of deaf speech. J. Acoustical Society of America, 78, 877-886. Monsen, R. B. (1978). Toward measuring how well hearing impaired children speak. Journal of Speech and Hearing Research, 21, 197-219. Munro, M., & Derwing, T. (1995). Foreign Accent, Comprehensibility, and Intelligibility in the Speech of Second Language Learners. Language Learning, 45 (1), 73-97. Munro, M. & Derwing, T. (1995). Processing time, accent and comprehensibility. Language and Speech, 38, 289-306.
17
Munro, M. J. & Derwing, T. M. (1999). Foreign Accent, Comprehensibility and Intelligibility in the Speech of Second Language Learners. Language Learning, Supplement 1, 49 (4), 285-310. Muraka, H. & Lambacher, S. (2000). Improving Japanese pronunciation of AE [r] using electronic visual feedback, Unpublished manuscript, University of Aizu, Japan. Perlmutter, M. (1989). Intelligibility rating of L2 speech pre and post-intervention. Perceptual and Motor Skills, 68, 515-521. Pisoni, D. B., Nusbaum, H., & Greene, B. (1985). Perception of synthetic speech generation by rule. Proceedings of the IEEE, 73, 1665-1676. Rochet, B. L. (1995). Perception and production of second-language speech sounds by adults. In W. Strange (Ed.), Speech Perception and Linguistic Experience. Timonium, MD: York Press. Rogers, C. (1997). Intelligibility of Chinese-Accented English. Unpublished doctoral dissertation, Indiana University, Bloomington, Indiana. Rogers, C. & Dalby, J. (1996). Prediction of foreign-accented speech intelligibility from segmental contrast measures. J. Acoustical Society of America, 100 (4) Pt. 2, 2725 (A). Rogers, C., Dalby, J. & DeVane, G. (1994). Intelligibility training for foreign-accented speech: A preliminary study. J. Acoustical Society of America, 96 (5) Pt. 2, 3348 (A). Schmid, P. M. & Yeni-Komschian, G. H. (1999). The effects of speaker accent and target predictability on perception of mispronunciations. Journal of Speech, Language & Hearing Research, 42(1), 56-64. Smith, Clarissa R. (1975). Residual Hearing and Speech Production in Deaf Children, Journal of Speech and Hearing Research, 18, 795-811. US Dept of Education, National Center for Education Statistics, The Condition of Education 2003, NCES 2003-067, Washington, DC: US Government Printing Office, 2003. US Dept of Homeland Security, Bureau of Citizenship and Immigration Services, US Citizenship and Immigration Services/Public Affairs. [News Releases: 1/18/02, 8/30/02 and 7/14/03].
18