Computer Assisted Language Learning. 2004, Vol. 17, Nos. 3â4 ... frequency contours of Japanese-native models shown on a computer screen. The subjects'.
Computer Assisted Language Learning 2004, Vol. 17, Nos. 3–4, pp. 357–376
Computer Assisted Pronunciation Training for Native English Speakers Learning Japanese Pitch and Durational Contrasts Yukari Hirata Colegate University, Hamilton, NY, USA
ABSTRACT This study assessed the efficacy of a pronunciation training program that provided fundamental frequency contours as visual feedback to native English speakers acquiring Japanese pitch and durational contrasts. Native English speakers who had previously studied Japanese for 1–5 years in the United States participated in training using Kay Elemetrics’ CSL-Pitch Program. The training materials were words, phrases, and sentences that contained Japanese pitch and durational contrasts. During training the subjects practiced matching the fundamental frequency contours of Japanese-native models shown on a computer screen. The subjects’ ability to produce and to perceive novel Japanese words was tested in two contexts, that is, words in isolation and words in sentences, before and after training. Their ability was compared to that of a native English control group that had previously studied Japanese for 1–5 years but did not participate in the training. The trained subjects improved significantly for words in sentences as well as for words in isolation. Also, the trained subjects improved significantly in perception as well as in production. These findings suggest that the pronunciation training program developed and used here was effective in improving the ability of native English speakers to produce and perceive Japanese pitch and durational contrasts.
1. INTRODUCTION This study assessed the efficacy of a pronunciation training program that provided visual feedback for second language (L2) learners. Its aim was to enable native English speakers to accurately pronounce Japanese words contrasting in duration and in pitch. Native English speakers’ difficulty in using duration and pitch to produce and perceive Japanese phonemic distinctions
Address correspondence to: Yukari Harata, Department of East Asian Languages and Literatures, Colgate University, 13 Oak Drive, Hamilton, NY 13346, USA. E-mail: yhirata@ mail.colgate.edu 10.1080/0958822042000319629$16.00 # Taylor & Francis Group Ltd.
358
YUKARI HIRATA
has been noted in various studies (Han, 1992; Landahl & Ziolkowski, 1995; Landahl, Ziolkowski, Usami, & Tunnock, 1992; Minagawa & Kiritani, 1996; Oguma, 2000; Yamada, Yamada, & Strange, 1995). For example, native English speakers have difficulty producing and perceiving the duration contrast, for example, /k oto/ with two short vowels, meaning ‘Japanese musical string instrument,’ and /k o:to/ with one long and one short vowel, meaning ‘coat.’ Native English speakers also have difficulty distinguishing the pitch contrast, for example, /ka´ ta/ with high and low tones, meaning ‘shoulder,’ and /kata´ / with low and high tones, meaning ‘form.’ In order to produce a Japanese word accurately, learners of Japanese must simultaneously pay attention not only to its pitch pattern but also to the duration of each segment. For example, the four segments, kata, can mean six different things: /ka´ ta/ ‘shoulder,’ /kata´ / ‘form,’ /ka´ t:a/ ‘won’ (with a geminate (long) consonant and high-low tones), /kat:a/ ‘bought’ (with a geminate (long) consonant and lowhigh tones), /ka´ :ta:/ ‘Carter (English name)’ (with two long vowels), and /ka´ t:a:/ ‘box-cutter’ (with a geminate (long) consonant and a long vowel). Instruments that display fundamental frequency (f0) contours of spoken utterances in real time on a computer screen have been used to enable participants to visually compare their speech with a model (see Anderson-Hsieh, 1994; Chun, 1998; Ehsani & Knodt, 1998; Neri, Cucchiarini, Strik, & Boves, 2002, for the use of f0 for learning intonation). Training with Kay Elemetrics’ CSL-Pitch Program, for example, can simultaneously direct participants’ attention to two prosodic aspects of speech: pitch, as indicated by the height of f0 contours (i.e., Hertz on the vertical axis in Fig. 1), and duration, as indicated by the horizontal length of f0 contours (i.e., seconds on the horizontal axis in Fig. 1). In the present study, efficacy of the training using this instrument was assessed with regard to three questions, which are discussed in detail below. 1.1. Trained Versus Control Groups Training techniques for the production of L2 contrasts have been developed and examined in several experimental studies (Dowd, Smith, & Wolfe, 1997; Kawai & Hirose, 2000; Landahl et al., 1992; Landahl & Ziolkowski, 1995; Leather, 1990; Schmidt & Meyers, 1995). These studies conducted experiments to show that certain aspects of L2 speech production can be trained, and that production training might also lead to improved perception. However, the development of training techniques for L2 pronunciation is in its infancy, and there is much to be explored in assessing whether various
COMPUTER ASSISTED PRONUNCIATION TRAINING
359
Fig. 1. Examples of fundamental frequency contours of model sentences spoken by two native Japanese speakers and accompanying prosody graphs used in training session 9. In the actual handouts of prosody graphs, words and sentences were also written in the Japanese orthography, hiragana and katakana.
360
YUKARI HIRATA
methods of pronunciation training are effective in enabling subjects to accurately produce L2 contrasts (see Neri et al., 2002; Pennington, 1999, for discussion). In particular, none of these experimental studies above compared trained subjects with control subjects who did not participate in training. Simply testing trained subjects before and after training does not reveal how much of the subjects’ improvement could be attributed to the training and how much was from simply repeating the tests. In order to differentiate the training effect from the repetition effect, the present study compared trained and control (untrained) subjects in a pre-test/post-test design. An effect of training should be indicated by the difference between the two groups’ improvements from the pre-test to the post-test. The first question explored in the present study, therefore, was whether native English subjects who have participated in production training improve their performance significantly more than the control subjects who only take a pre-test and a post-test. 1.2. Word-in-Isolation Versus Word-in-Sentence Contexts In seven out of the ten training sessions of the present training program, target words were embedded in phrases or sentences. This contrasts with the earlier studies (e.g., Kawai & Hirose, 2000; Landahl & Ziolkowski, 1995; Leather, 1990) that presented only isolated words of L2 contrasts. For example, in Figure 1, the minimal pair words contrasting in pitch, /ka´ u/ (high-low) (shown in the upper panel of Fig. 1) versus /kau/ (low-high) (shown in the lower panel), were not presented sequentially in isolation, but embedded in sentences. Similarly, the durational difference between /kawaı´:/ versus /kawai/ in Figure 1 was not made explicit, since the words were embedded in separate sentences. Thus, subjects’ attention was directed not to how these words minimally differed from each other, but to imitating the pitch contours of the overall utterances. The intent of this method was to provide subjects with minimal pair words in a variety of connected speech contexts. A number of studies have shown that perception and production of words or segments in isolation is different from that of words in sentence contexts (Greenspan, Nusbaum, & Pisoni, 1988; Hirata, 1990a, 1990b, 1999a, 1999b; Strange et al., 1998; Weiss, 1992). The second question in this study, therefore, was whether production training that focuses on words embedded in phrases or sentences has an effect on trained subjects’ performance in isolated word and sentence contexts.
COMPUTER ASSISTED PRONUNCIATION TRAINING
361
1.3. Production Versus Perception Several studies agree that gains from perceptual training on L2 contrasts can transfer to production (Bradlow, Pisoni, Yamada, & Tohkura, 1997; Lambacher, Martens, & Kakehi, 2002; Rochet & Chen, 1992; Wang, 2001). However, there has been little agreement on the effects of production training relative to perceptual training (Catford & Pisoni, 1970; Leather, 1990; Leather & James, 1991; Schneiderman, Bourdages, & Champagne, 1988; Weiss, 1992). Leather and James (1991) concluded, ‘‘training in one modality tended to be sufficient to enable a learner to perform in the other’’ (p. 320). On the other hand, Schneiderman et al. (1988) pointed out that the training of production or perception may dissociate the link between production and perception that the trained subjects initially had. In the present study, the task for the native English participants during the production training was to match f0 and duration patterns exhibited by Japanese target words (either isolated or embedded in sentences). Participants were given as real-time visual feedback the f0 contours of their own speech and of models. They could also listen to model tokens whenever they wished during training. Thus, this method did involve perception to some extent.1 However, since subjects might attempt to imitate the models without attending to the actual sounds of the tokens, but only to the visual input, it is not clear whether this training method affects subjects’ perception as well as production. The third question addressed in this study was whether the present training method had effects not only on production but also on perception. For this question, subjects were tested in these two domains before and after training. 2. QUESTIONS In summary, the questions addressed by this study were: (1) Is production training that provides f0 contours as visual feedback effective in assisting the overall acquisition of Japanese pitch and duration contrasts? Does the overall 1
This training procedure closely followed de Bot’s (1983) production training procedure, with which subjects were trained on English sentential intonation. In his experiment, but not in the present experiment, however, the number of times subjects listened to models and produced tokens was recorded on a computer. The number of times subjects produced tokens exceeded the number of times they listened to models in de Bot (1983). From this result, it was assumed that in the present study as well, subjects’ activities would involve more production than perception (see also the training procedure section of the present paper).
362
YUKARI HIRATA
improvement of the trained group from the pre- to the post-tests differ from the improvement of the control group? (2) Having participated in the training that provided isolated words, phrases, and sentences, does the trained group improve its test scores in both word-in-isolation and word-in-sentence contexts? (3) Does this production training have effects on both production and perception? If production learning occurs, is it accompanied by perceptual learning?
3. METHODOLOGY The structure of the experiment was as follows: one group of native English speakers was trained to produce Japanese pitch and durational contrasts in isolated words and in words embedded in phrases or sentences. This trained group took a pre-test (before training) and a post-test (after training) in both production and perception. A control group also took the same pre- and post-tests, but did not participate in training. Production and perception were assessed with isolated words (word-in-isolation context) and with those same words embedded in carrier sentences (word-in-sentence context). 3.1. Subjects Eight native speakers of English (5 males and 3 females) who were taking a second-year Japanese course at the University of Chicago (50 min a day, five times a week) participated. The subjects were randomly assigned to a training and a control group (n ¼ 4 each). The subjects’ ages ranged from 18 to 21 years old (mean ¼ 19.3 for each group). None of the subjects reported hearing problems. Besides having taken the introductory Japanese class at the University of Chicago in the previous year, five subjects had studied Japanese before college. Three subjects in the training group studied Japanese for 2 months, 2 years, and 3 years in high school, respectively, and two subjects in the control group studied it for 4 years in high school. The subjects had had exposure to spoken Japanese mainly by listening to their instructors and audiotapes that were required for the present and the past courses. Two of the subjects (one in the training and one in the control groups) had studied French in high school for less than 4 years, but the others had no experience of studying any other foreign languages. The training group participated in a pre-test, training, and a post-test, and the control group took the pre- and the post-tests at the same time as the
COMPUTER ASSISTED PRONUNCIATION TRAINING
363
training group. The control group, in addition, was given the opportunity to participate in the training after the post-test although their training data were not part of the present experiment.2 3.2. Training Procedure The training materials included 33 pairs and 2 triplets of Japanese words contrasting in pitch and duration. These pairs and triplets of words were presented in isolation or embedded in phrases or sentences. The training consisted of 10 training sessions. Sessions 1–3 presented a total of 47 words in isolation. Session 4 presented words in 15 short phrases, and Sessions 5–10 presented words in a total of 57 sentences of different length. The speech materials were recorded by four native speakers of Japanese (2 males and 2 females), as models for subjects’ production training. The materials were all recorded on DAT cassettes, and then digitized at 20 kHz on a PC. An acoustic display of speech on a computer is not easily interpretable by non-phoneticians. The problem of dealing with raw acoustics of speech in computer assisted language learning has been pointed out by Chun (1998), Pennington (1999), and Neri et al. (2002). For example, word boundaries do not correspond to breaks in f0 lines, and an f0 peak might not exactly correspond to the location of pitch accent perceived by native Japanese speakers (Hasegawa & Hata, 1992; Sugitoo, 2000). It may not be clear to L2 learners whether a pitch rise or fall is phonetically relevant. To remedy such problems, a handout was made for each session, in which explanation was given on how to pronounce the utterances using prosody graphs (see Matsuzaki, Kushida, Tsukiji, & Kawano, 1997). A prosody graph (Fig. 1; with circles and arrows) is a schematic description of how native Japanese speakers perceive f0 contours. The length of the vertical lines indicated perceived relative pitch height, and the horizontal lines at the bottom corresponded to words or phrases. With the prosody graphs subjects were able to figure out on their own which part of the words or 2 This procedure was taken for the following reason: The usual experimental design for finding effects of training is that the training group participates in a pre-test, training, and a post-test, and a control group does not receive any training but only takes a pre- and a post-test. This design, however, is problematic in a real educational setting, since language instructors do not normally like the experimenter to give their students unequal opportunities. In addition, if the recruiting method was to ask for volunteers for each group, we might end up creating two groups that differed in their motivation. Therefore, a training session was added for the control group after the two experimental tests had taken place. In this way, we gave both groups an equal opportunity to experience production training with visual feedback.
364
YUKARI HIRATA
sentences corresponded to which part of the f0 contours, and roughly where the word boundaries would be in the f0 contours. Subjects were trained individually in the language laboratory of the University of Chicago. In each of the 10 training sessions, the subjects opened each model audio file, listened to the model (isolated word, word in phrase, or word in sentence), and watched the pitch pattern of the model in real time. Then they produced the utterance on another empty window, and the program overlaid the model pitch contours onto the subjects’ pitch contours. The subjects repeated this procedure, producing the same material until their pitch contours matched the model contours. The subjects were able to listen to model tokens whenever they wanted (see de Bot, 1983, and footnote 1). Although the number of times subjects listened to models was not recorded, the experimenter (Y. H.) observed in all the training sessions that it was approximately 1–4 times per token. It was also observed that the number of times they listened to models did not exceed the number of times they produced them. Subjects were instructed that they move on to the next training set when they thought that the steepness and the duration of their f0 contours matched that of the model.3 The examiner (Y. H.) was available at any time of the subjects’ training sessions to answer their questions regarding the operation of CSL-Pitch or the materials. Each session took about 30 minutes. The 10 training sessions were completed in three and a half weeks. 3.3. Testing Procedure 3.3.1. Production Tests Production and perception tests were conducted before and after training. Twenty-one words including pitch and durational contrasts were chosen for production tests, for example, /a´ me/ vs. /ame/, and /ı´So/ vs. /iS:o/ vs. /iS:o:/. Nine of the 21 words were used in training, and 12 words were not. The same sets of words were used in the pre- and the post-tests. Subjects’ production of the 21 words was recorded three times in isolation (word-in-isolation context), and three times in the carrier sentence /wataSi wa ____ ga ienakat:a/ (‘I could 3
The subjects were also told that the f0 range for female models was greater than that for male models in general (cf. a male and a female speaker in Fig. 1). They were told that it was acceptable for a male subject not to have his f0 contours as steep rises and falls as those of a female model, and vice versa.
COMPUTER ASSISTED PRONUNCIATION TRAINING
365
not say ___.’) (word-in-sentence context). The order of the 21 words was randomized in each of the three repetitions. The subjects’ production was evaluated by two native speakers of Japanese from Tokyo, who were naive as to the purpose of the experiment and who had never heard the subjects’ voices. Each judge identified a total of 2016 utterances produced by all subjects [21 words3 repetitions2 contexts (word-in-isolation and word-in-sentence)2 tests (pre- and post-tests)8 subjects]. Analog tape copies of the original DAT cassettes were prepared for the pre-test and the post-test and randomly labeled Test A and Test B. One judge listened to Test A first, and then proceeded to Test B, whereas the other judge listened to the two sets of tapes in the opposite order. The order of the subjects that each judge heard was randomized in each of Tests A and B, but production across different subjects was not mixed, that is, each judge identified all of one subject’s production before moving to the next subject. The order of the 21 words was the same randomized order as they were originally recorded by the subjects. Each judge identified all the utterances produced by all subjects. On the evaluation sheets, the judges were asked to choose one out of 3–5 alternatives, for example, ‘‘a´ me/ame/‘cannot-tell’’’ (three alternatives), ‘‘ı´So/ iS:o/iS:o:/‘cannot-tell’’’ (four alternatives), and ‘‘toku/toku/tok:u/to:ku/ ‘cannot-tell’’’ (five alternatives), which were all written in Japanese orthography. For example, for the word /t oku/, which was randomly ordered among the 21 test words and recorded by subjects, the judges were asked which of the four words (t oku/toku/tok:u/to:k u/) they thought the subjects had intended to say. The number of the judges’ identifications that matched the subjects’ intent (production intelligibility) was then calculated. Since preliminary analysis on each judge’s identification indicated no major discrepancy between the two judges, the present paper only reports the combined data. 3.3.2. Perception Tests Perception tests also consisted of a pre-test and a post-test. Each test contained 30 novel words that did not appear in training. The perceptual stimuli were recorded by two female native speakers of Japanese. Thirty Japanese words were produced both in isolation (word-in-isolation context) and in carrier sentences (word-in-sentence context). Thus, there were a total of 60 utterances (30 words2 speakers) in each of the four tests (the word-in-isolation and the word-in-sentence tests in both the pre- and the post-tests), randomly ordered and divided into six blocks.
366
YUKARI HIRATA
For the perception tests of words-in-isolation, subjects first clicked a play button on a computer screen to listen to each word spoken in isolation. Subjects then chose by clicking one of the nine pitch patterns (HL, LH, HLL, LHL, LHH, HLLL, LHLL, LHHL, LHHH; H ¼ high, L ¼ low) of prosody graphs that matched the word they had just heard. This method required the ability to identify not only the pitch but also the duration of the words. For example, for the word /kate:/ ‘family,’ the correct answer is LHH (HH: representing a long vowel), and not LH (H: representing a short vowel). For the perception tests of words-in-sentences, a carrier sentence /soko wa ____ to jonde kudasai/ (‘Please read there as ___.’) was written on the computer screen. Subjects saw this carrier sentence, clicked the play button, and heard a sentence that included a target word variably, for example, /soko wa hı´ju to jonde kudasai/. Then the subjects were asked to choose the pitch pattern that matched the target word. In the example given, the correct answer for /hı´ju/ ‘metaphor’ was HL. Each of the pre- and the post-tests lasted about 30 min. 3.4. Analyses The percent correct scores of the word-in-isolation and the word-in-sentence tests in pre- and post-tests were calculated. These perceptual scores and the production intelligibility scores (i.e., percent correct scores of the native Japanese judges’ identifications of the subjects’ production) were examined in a mixeddesign analysis of variance (ANOVA). The factors were Group (trained vs. control), Test (pre-test vs. post-test), Context (word-in-isolation vs. word-insentence tests), and Domain (production vs. perception). Group was a betweensubjects factor, and Test, Context, and Domain were within-subjects factors.
4. RESULTS 4.1. Overall Effects of Training The first question of interest was whether the trained group improved its overall test scores from the pre- to the post-tests significantly more than the control group. The trained group of subjects showed a larger improvement in their overall test scores (including production and perception tests) from the pre-test (52.4%) to the post-test (69.5%) than the control group did (pre-test: 52.3%, post-test: 57.3%). The four-factor ANOVA revealed a significant GroupTest interaction [F(1, 6) ¼ 11.972, p ¼ .013]. This indicated that the
COMPUTER ASSISTED PRONUNCIATION TRAINING
367
trained and the control groups showed significantly different amounts of improvement from the pre-test to the post-test. For the pre-test, the difference between the scores of the two groups was not significant [t(6) ¼ 0.008, p ¼ .190]. However, for the post-test, the difference between the two groups was significant [t(6) ¼ 1.364, p ¼ .003]. These results indicate that the training program was effective for improving the trained subjects’ overall ability to distinguish Japanese words contrasting in pitch and duration. 4.2. Training Effects on Word-in-Isolation Versus Word-in-Sentence Contexts The next question evaluated was whether the trained subjects showed equal improvement for the word-in-isolation and word-in-sentence contexts. The
Fig. 2. Percent correct scores for trained and control groups at pre- and post-tests in word-inisolation context (a) and word-in-sentence context (b). The scores include both production and perception scores, pooled for each context. The error bars represent one standard error from the mean.
368
YUKARI HIRATA
four-factor GroupTestContextDomain ANOVA revealed only a marginal GroupTestContext interaction [F(1, 6) ¼ 5.156, p ¼ .064]. Figure 2 shows the trained versus the control groups’ pre- and post-test scores in the two contexts separately. In this figure, the subjects’ production and perception scores in each context were combined and averaged. In the word-in-sentence context ((b) in Fig. 2), the trained group showed a significant improvement: the difference between the pre- and the post-tests for the trained group (21.4%) was significant [t(3) ¼ 4.949, p ¼ .016]. In contrast, the difference for the control group (0%) was non-significant [t(3) ¼ 0.006, p ¼ .996]. In the word-in-isolation context ((a) in Fig. 2), the difference between the two groups’ improvement was much less than that in the word-insentence context. However, the trained group’s improvement (12.8% from the pre- and the post-tests) reached significance [t(3) ¼ 3.326, p ¼ .045] while the control group’s improvement (10%) did not [t(3) ¼ 2.132, p ¼ .123]. In summary, the trained group showed significant improvement in the two contexts, but it improved more in the word-in-sentence than in the wordin-isolation contexts. In contrast, the control group’s improvement was nonsignificant in both the word-in-isolation and the word-in-sentence contexts. 4.3. Training Effects on Production and Perception The above four-factor ANOVA revealed no interaction among Group, Test, and Domain [F(1, 6) ¼ .034, p ¼ .860]. This assures that the GroupTest interactions in the two domains (production and perception) were comparable to the GroupTest interaction reported earlier. Figure 3 shows the percent correct scores of the pre- and the post-tests by the trained versus the control groups separately for perception and production, averaging over the word-inisolation and the word-in-sentence tests. For both perception and production, the trained group showed greater improvement from the pre- to the post-tests than the control group. For perception ((a) in Fig. 3), the difference between the pre-test scores of the two groups (0.9%) was not significant [t(6) ¼ 0.084, p ¼ .101]. However, the difference between the post-test scores of the two groups (12.1%) was significant [t(6) ¼ 1.453, p ¼ .004]. Comparable results were found for production ((b) in Fig. 3). In summary, the trained subjects significantly improved their ability not only to produce but also to perceive the pitch and durational contrasts. An additional analysis examined production scores obtained just for novel words (i.e., scores on the nine words that appeared in training were excluded from the overall production scores reported earlier). The production scores
COMPUTER ASSISTED PRONUNCIATION TRAINING
369
Fig. 3. Percent correct scores of perception (a) and production (b) tests for trained and control groups at pre- and post-tests. The perception scores were calculated from the number of test items subjects correctly identified. The production scores were calculated from the number of native Japanese judges’ identifications of subjects’ production (out of 21 words) that matched the subjects’ intent (production intelligibility). The scores of wordin-isolation and word-in-sentence tests were pooled for each of perception and production. The error bars represent one standard error from the mean.
on the 12 novel words by the trained versus control groups were calculated for both the pre- and the post-tests. The results were comparable to those on all production words. At the pre-test, the score of the trained group did not differ from that of the control group (69.3% vs. 69.3%) [t(6) ¼ 0, p ¼ .413]. However, at the post-test, the score of the trained group was significantly higher than that of the control group (83.5% vs. 77.1%) [t(6) ¼ .631, p ¼ .040]. In summary, effects of training were found not only on the overall production scores, but also on the scores excluding the practiced production words.
370
YUKARI HIRATA
Table 1. Individual Subjects’ Improvement on Production Versus Perception Test Scores. Subjects
Production
Perception
Pre-test
Post-test
Difference
Pre-test
Post-test
Difference
Trained Trained Trained Trained
1 2 3 4
71.0 64.7 51.6 80.2
89.7 75.4 81.0 87.7
18.7 10.7 29.4 7.5
46.7 33.3 36.7 35.0
55.8 53.3 52.5 60.8
9.1 20.0 15.8 25.8
Control Control Control Control
1 2 3 4
71.4 41.7 85.3 65.1
82.1 47.2 90.9 63.9
10.7 5.5 5.6 1.2
42.5 19.2 63.3 30.0
52.5 32.5 61.7 27.5
10.0 13.3 1.7 2.5
Note. The columns of ‘‘Pre-test’’ and ‘‘Post-test’’ indicate the percent correct scores of the respective tests. The scores of word-in-isolation and word-in-sentence tests were averaged in each of pre- and post-tests. The ‘‘Difference’’ score indicates the Post-test minus Pre-test score for each individual.
The scores of individual subjects were examined to evaluate changes in production and perception. Table 1 shows the individual subjects’ percent correct pre- and post-test scores for production and perception. The trained subjects’ improvement in production and perception was not consistently greater than the control subjects’ improvement. The magnitude of training effects, if any, depended on each subject. For trained subjects #2 and #4, the effect of training was found more clearly in perception than in production. For example, trained subject #2 and control subject #1 both showed the same amount of improvement in production (10.7%), but trained subject #2’s improvement was twice as great as control subject #1’s in perception (20.0% vs. 10.0%). Similarly, trained subject #4’s improvement in production was greater than that of control subjects #2 and #3 only by 1.9–2.0%. However, trained subject #4’s perception improvement (25.8%) was approximately twice as much as that of control subject #2 (13.3%), and contrasted even more with that of control subject #3 (1.7%). In contrast, the effect of training was greater for production than perception in the comparison of trained subject #1 and control subject #1. Although their improvement in perception was similar (9.1% vs. 10.0%), the improvement of trained subject #1 (18.7%) was superior to that of control subject #1 (10.7%) in production. Similarly, the production improvement of trained subject #3 (29.4%) was greater than that of control subject #2 (5.5%) (or greater than all
COMPUTER ASSISTED PRONUNCIATION TRAINING
371
control subjects) although the perception improvement of these two subjects was comparable (15.8% vs. 13.3%). In summary, the group means indicated that the trained subjects improved both in production and perception. However, the comparison of individual subjects indicated that the degree of improvement in the trained subjects varied in production and perception. Two trained subjects (#2 and #4) showed greater perception improvement than the control subjects whose production improvement was comparable. The other trained subjects (#1 and #3), however, showed greater production improvement than the control subjects whose perception improvement was comparable.4
5. CONCLUSIONS This study yielded three major findings. First, the subjects who participated in the production training improved their overall test scores significantly more than the control subjects did. With the CSL-Pitch Program, the trained subjects’ attention was directed to both pitch and duration of the utterances they practiced producing. The focus of the training was for subjects to match the f0 contours of words, phrases, and sentences to those of the models. In 7 out of 10 sessions, minimal word pairs were only implicitly used in phrases and sentences, for example, /kawai/-/kawaı´:/ and /ka´ u/-/kau/ (Fig. 1). It is worth noting that the testing methods were quite different from the training method. In the training, subjects were never asked to identify the pitch patterns of words, as they were asked in the perception tests. Similarly, for production, their task in the training was simply to match the f0 contours of their production to those of the models, and they were never given feedback as to whether the words or the sentences they had produced could be perceived correctly by native Japanese listeners. The training resulted in the increased ability to perceive the pitch and durational 4 Trained subject #3 was the one who had 3 years of experience studying Japanese in high school for 3 years, and this subject made the greatest overall improvement in perception and production. Control subjects #1 and #4 were the two who had experience studying Japanese in high school for 4 years, but they showed contrasting amounts of improvement in perception and production. The individual differences observed for perception and production did not seem to be related to the subjects’ previous experience with Japanese in the present study. However, it would be interesting to systematically examine in the future whether the amount of improvement bears any relationship to the amount of exposure to Japanese the subjects previously had.
372
YUKARI HIRATA
patterns of the words, and the increased intelligibility of the subjects’ production as evaluated by native Japanese judges. The trained group showed improvement on both the practiced and the novel production test materials, and also on the perception test materials, which were all novel. In conclusion, the subjects who participated in the present training while taking the regular Japanese language course acquired a generalized ability to produce and perceive Japanese words contrasting in pitch and in duration. In contrast, the control subjects, who were only attending the Japanese language course (but who did not participate in the present training), did not significantly improve their production and perception. The second finding concerns the subjects’ performance as tested in isolated word and sentence contexts. The present training method using words, phrases, and sentences was found to have an effect on the trained group’s performance in both word-in-isolation and word-in-sentence contexts, extending the finding by Hirata (1999b). The control group showed greater improvement in the word-in-isolation than in the word-in-sentence contexts, although neither one was significant. In contrast, the trained group made robust improvement in the word-in-sentence context, and also showed a significant but smaller magnitude improvement in the word-in-isolation context. These results contrast with those of Weiss (1992) in which trained L2 learners improved their scores for isolated words but not for sentences. An implication of the present results is that L2 learners can learn from sentences to distinguish minimal pair words, even though their perception and production of L2 contrasts are consistently less accurate in sentence than in word contexts. Developing abilities to produce and perceive L2 contrasts in sentences, instead of only in an isolated word context, is an important factor for L2 learners coping with real-world situations. They must eventually learn to pay attention to multiple aspects of speech, that is, not only how a word is distinguished from another in a minimal pair but also, for example, whether the word has a contrastive pitch in a sentence. (In reality, L2 learners must also integrate the acquired phonetic and phonological knowledge into the use of language for meaningful communication, which is beyond the scope of this paper.) It is traditional to present minimal pair words of L2 contrasts in isolation in order to direct learners’ attention to fine differences between those words. The finding that subjects could learn from sentences, however, suggests that it does not have to be that way, perhaps, for intermediate or advanced learners. Another implication of the present findings is that learning that takes place during training reflects the kinds of stimuli given in training. The training
COMPUTER ASSISTED PRONUNCIATION TRAINING
373
included seven sessions of phrases and sentences, and only three sessions of isolated words. The proportion of the phrase/sentence versus the isolated word materials seems to reflect the trained group’s robust improvement in the wordin-sentence context versus less robust improvement in the word-in-isolation context. The third finding was that training subjects in producing Japanese words and sentences affected not only their production but also their perception. The group means suggest that the trained group learned to both produce and perceive novel words correctly, and that this production training did indeed involve implicit perceptual learning. This result might not be surprising, since the training had both explicit production and implicit perception components. Subjects’ practice in producing utterances with visual feedback might have strengthened the link not only between production and relevant visual information (de Bot, 1983; de Bot & Mailfert, 1982; Dowd et al., 1997; Flege, 1988), but also between perception and visual information (Akahane-Yamada, Bradlow, Pisoni, & Tohkura, 1997; McGurk & MacDonald, 1976). The method used in the present training does not seem to have dissociated the link between production and perception that subjects had initially had, as pointed out by Schneiderman et al. (1988). However, this finding must be interpreted with caution. Out of four trained subjects, two (subjects #2 and #4) showed greater improvement in perception than production when they were compared to comparable control subjects. The opposite pattern of improvement was found for the other two trained subjects (subjects #1 and #3). Bradlow et al. (1997) concluded from their perception training study that perceptual learning generally transfers to production, but ‘‘learning in the perceptual domain is not a necessary or sufficient condition for learning in the production domain: the processes of learning in the two domains appear to be distinct within individual subjects’’ (p. 2307). Since the number of subjects in the present study was small, more data are necessary to determine whether production training also yields great individual variation in the production-perception improvement as in Bradlow et al. (1997).
6. DIRECTIONS FOR FUTURE RESEARCH The present experiment was limited in that only a small number of subjects was tested. Nevertheless, the three major findings above motivate further studies using this training method to address different theoretical issues
374
YUKARI HIRATA
specifically. Testing native English learners of Japanese at several different institutions who are matched in their age, duration of study of Japanese, and amount of exposure to spoken Japanese should add to the generalizability of the present findings. Future research should also examine long-term effects of this training, that is, whether the trained subjects retain their test scores several weeks or months after the completion of the training. Further, it would be of interest to examine whether other types of training methods, for example, audio tape listening, can yield the same results as those found in this study. This examination will clarify whether the improvement subjects showed in the present study is not simply due to the extra time they spent for training, but is due to the specific effects of the present training method. In conclusion, the findings in the present experiment suggest that the production training method developed and tested in this study was effective in enabling native English speakers to produce and perceive Japanese pitch and durational contrasts. This exploratory experiment, however, motivates further studies to address various theoretical and pedagogical issues on the production and perception of L2 contrasts.
ACKNOWLEDGMENTS This experiment was conducted using the facilities of the Language Laboratories and Archives of the University of Chicago, supported by Consortium for Language Teaching and Learning. I thank Karen L. Landahl, Barbara Need and Jack Dovidio for their assistance and encouragement, and Jon Bernard for creating and implementing the Tcl/TK program for the experiment. I also thank James E. Flege, Satomi Imai, Kimiko Tsukada, Katsura Aoyama, and Tetsuo Harada for improving this manuscript.
REFERENCES Akahane-Yamada, R., Bradlow, A.R., Pisoni, D.B., & Tohkura, Y. (1997). Effects of audiovisual training on the identification of English /r/ and /l/ by Japanese speakers. Journal of the Acoustical Society of America, 102–105(2), 3137. Anderson-Hsieh, J. (1994). Interpreting visual feedback on suprasegmentals in computer assisted pronunciation instruction. CALICO Journal, 11(4), 5–22. Bradlow, A.R., Pisoni, D.B., Yamada, R.A., & Tohkura, Y. (1997). Training Japanese listeners to identify English /r/-/l/. IV. Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America, 101, 2299–2310. Catford, J.C., & Pisoni, D.B. (1970). Auditory vs. articulatory training in exotic sounds. Modern Language Journal, LIV(7), 477–481.
COMPUTER ASSISTED PRONUNCIATION TRAINING
375
Chun, D.M. (1998). Signal analysis software for teaching discourse intonation. Language Learning and Technology, 2(1), 61–77. de Bot, K. (1983). Visual feedback of intonation. I. Effectiveness and induced practice behavior. Language and Speech, 26(4), 331–350. de Bot, K., & Mailfert, K. (1982). The teaching of intonation: Fundamental research and classroom applications. TESOL Quarterly, 16, 71–77. Dowd, A., Smith, J., & Wolfe, J. (1997). Learning to pronounce vowel sounds in a foreign language using acoustic measurements of the vocal tract as feedback in real time. Language and Speech, 41(1), 1–20. Ehsani, F., & Knodt, E. (1998). Speech technology in computer-aided language learning: Strengths and limitations of a new call paradigm. Language Learning and Technology, 2(1), 45–60. Flege, J.E. (1988). Using visual information to train foreign-language vowel production. Language Learning, 38(3), 365–407. Greenspan, S.L., Nusbaum, H.C., & Pisoni, D.B. (1988). Perceptual learning of synthetic speech produced by rule. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 421–433. Han, M.S. (1992). The timing control of geminate and single stop consonants in Japanese: A challenge for nonnative speakers. Phonetica, 49, 102–127. Hasegawa, Y., & Hata, K. (1992). Fundamental frequency as an acoustic cue to accent perception. Language and Speech, 35(1, 2), 87–98. Hirata, Y. (1990a). Perception of geminated stops in Japanese word and sentence levels. The Phonetic Society of Japan, 194, 23–28. Hirata, Y. (1990b). Perception of geminated stops in Japanese word and sentence levels by English-speaking learners of Japanese language. The Phonetic Society of Japan, 195, 4–10. Hirata, Y. (1999a). Acquisition of Japanese rhythm and pitch accent by English native speakers. Doctoral dissertation, University of Chicago. Ann Arbor: Bell & Howell. Hirata, Y. (1999b). Effects of sentence- vs. word- level perceptual training on the acquisition of Japanese durational contrasts by English native speakers. In Proceedings of the 14th International Congress of Phonetic Science (Vol. 2, pp. 1413–1416). San Francisco, CA. Kawai, G., & Hirose, K. (2000). Teaching the pronunciation of Japanese double-mora phonemes using speech recognition technology. Speech Communication, 30, 131–143. Lambacher, S., Martens, W., & Kakehi, K. (2002). The influence of identification training on identification and production of the American English mid and low vowels by native speakers of Japanese. In Proceedings of the International Conference on Spoken Language Processing (pp. 245–248). Denver, CO. Landahl, K.L., Ziolkowski, M., Usami, M., & Tunnock, B. (1992). Interactive articulation: Improving accent through visual feedback. In Proceedings of the Second International Conference on Foreign Language Education and Technology (pp. 283–292). Landahl, K.L., & Ziolkowski, M.S. (1995). Discovering phonetics units: Is a picture worth a thousand words? In A. Dainora, R. Hemphill, B. Luka, B. Need, & S. Pargman (Eds.), CLS 31: Papers from the 31st Regional Meeting of the Chicago Linguistic Society, Volume 1: The Main Session (pp. 294–316). Chicago: The Chicago Linguistic Society. Leather, J. (1990). Perceptual and productive learning of Chinese lexical tone by Dutch and English speakers. In J. Leather & A. James (Eds.), New Sounds 90: Proceedings of the 1990 Amsterdam Symposium on the Acquisition of Second-Language Speech (pp. 72–97). University of Amsterdam.
376
YUKARI HIRATA
Leather, J., & James, A. (1991). The acquisition of second language speech. Studies in Second Language Acquisition, 13, 305–341. Matsuzaki, H., Kushida, M., Tsukiji, Y., & Kawano, T. (1997). Inritsu shidoo no koomoku no teijijun ni tsuite. Presented at Nihongo kyooiku gakkai dai 12 kai kenkyuu shuukai. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Minagawa, Y., & Kiritani, S. (1996). Discrimination of the single and geminate stop contrast in Japanese by five different language groups. Annual Bulletin of Research Institute of Logopedics and Phoniatrics, 30, 23–28. Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy–technology interface in computer assisted pronunciation training. Computer Assisted Language Learning, 15(5), 441–467. Oguma, R. (2000). Perception of Japanese long vowels and short vowels by English-speaking learners. Sekai no Nihongo Kyoiku, 10, 43–55. Pennington, M.C. (1999). Computer-aided pronunciation pedagogy: Promise, limitations, directions. Computer Assisted Language Learning, 12(5), 427–440. Rochet, B.L., & Chen, F. (1992). Acquisition of the French VOT contrasts by adult speakers of Mandarin Chinese. In Proceedings of International Conference on Spoken Language Processing (Vol. 1, pp. 273–276). Banff, Canada. Schmidt, A.M., & Meyers, K.A. (1995). Traditional and phonological treatment for teaching English fricatives and affricates to Koreans. Journal of Speech and Hearing Research, 38, 828–838. Schneiderman, E., Bourdages, J., & Champagne, C. (1988). Second-language accent: Relationship between discrimination and perception in acquisition. Language Learning, 38, 1–19. Strange, W., Akahane-Yamada, R., Kubo, R., Trent, S.A., Nishi, K., & Jenkins, J.J. (1998). Perceptual assimilation of American English vowels by Japanese listeners. Journal of Phonetics, 26, 311–344. Sugitoo, M. (2000). Onsee-Hakei wa Kataru. Osaka: Izumi Shoin. Wang, Y. (2001). Perception, production, and neurophysiological processing of lexical tone by native and non-native speakers. Doctoral dissertation, Cornell University. Weiss, W. (1992). Perception and production in accent training. Revue de phonetique appliquee, 102, 69–81. Yamada, T., Yamada, R.A., & Strange, W. (1995). Perceptual learning of Japanese mora syllables by native speakers of American English: Effects of training stimulus sets and initial states. In Proceedings of the 13th International Congress of Phonetic Sciences (Vol. 1, pp. 322–325).