Jonathan Dalby and Diane Kewley-Port
Explicit Pronunciation Training Using Automatic Speech Recognition Technology Jonathan Dalby Communication Disorders Technology, Inc.
Diane Kewley-Port Indiana University and Communication Disorders Technology, Inc. ABSTRACT A system is described, provisionally named Pronto, which uses automatic speech recognition (ASR) for training pronunciation of second languages in adult learners. The first version of Pronto was developed for native speakers of American English learning Spanish and for Mandarin Chinese speakers learning English. Pronto grows out of work in the Indiana Speech Training Aid (ISTRA) research program, which has demonstrated significant improvement in the pronunciation of hearing-impaired and normalhearing but misarticulating children through the use of ASR-derived feedback. This feedback has also been shown to improve pronunciation in adult learners of a second language. Methods are described for developing training in Pronto, and results are presented from evaluating classes of speech recognizers for use in different aspects of pronunciation training.
KEYWORDS Pronunciation Training, Speech Recognition, Speech Training Aids, Evaluation, User Tests, Minimal Pairs
INTRODUCTION One of the most difficult tasks associated with learning a second language as an adult is mastering the phonological and phonetic systems of the new language. The twenty- or thirty- or even forty-year immigrant © 1999 CALICO Journal
Volume 16 Number 3
425
Explicit Pronunciation Training who still speaks with a “heavy, thick, or strong” accent is a common anecdotal character. In this paper we will describe a system, provisionally named Pronto, that uses automatic speech recognition (ASR) technology for explicit training of second language pronunciation in adult learners. The first version of Pronto was developed for native speakers of American English who wish to learn Spanish, and for native speakers of Mandarin Chinese learning English. We use the term “explicit training” to describe the system because it includes curricula for specific pairs of native/target language sounds for which empirical methods have been applied to discover typical segmental (phoneme) errors of learners. Assumptions that have motivated the design features of Pronto include the following: 1) Segment-level (phoneme) errors seriously degrade the intelligibility of speech by nonnatives (Rogers & Dalby, 1996); 2) Typical pronunciation difficulties for a given target language will differ for speakers of different native languages (Kenworthy, 1987; Flege & Wang, 1989, Rochet, 1995); 3) Second language learners may have both a production and a perception intelligibility deficit for certain phonological contrasts of the target language (Strange, 1995); 4) Explicit production and perception training on difficult segmental contrasts in the second language can improve the intelligibility of nonnative speakers (Rogers, Dalby, & DeVane, 1994); 5) Feedback derived from automatic speech recognition technology should be similar to speech quality judgments made by humans (Anderson & Kewley-Port, 1995; Kewley-Port et al., 1991; Watson et al., 1989). Methods will be described for developing structured curricula for second language intelligibility training in Pronto, as well as techniques for evaluating different ASR technologies to support these curricula. Finally, methods will be discussed for evaluating the effectiveness of the intelligibility training developed in Pronto.
PRIOR WORK IN NATIVE LANGUAGE PRONUNCIATION: SPEECHINTERACTIVE TRAINING AIDS FOR USE IN SPEECH THERAPY The work in Pronto stems from the Indiana Speech Training Aid (ISTRA) research program, which seeks to develop training aids for speech-language pathologists and educators of the deaf to use in providing pronunciation training to clients with a variety of speech disorders. One focus is 426
CALICO Journal
Jonathan Dalby and Diane Kewley-Port on hearing-impaired and normal-hearing but misarticulating children. ISTRA prototypes have led to demonstrable improvement in the quality of pronouncing target words, with some generalization to nontarget words, based on independent evaluation by juries of listeners (Kewley-Port et al., 1991). ISTRA employs a template-based, speaker-dependent isolated word recognizer (also known as discrete ASR). This technology is combined with speech drills in graphical game-like formats, with names such as BASEBALL, MOONRIDE, and BOWLING. These formats provide appealing environments for prompting articulation and giving pronunciation feedback. Figure 1 shows the graphical interface for the BOWLING game, where the word to be pronounced is displayed on screen and feedback on pronunciation quality is in the form of number of pens knocked down. Figure 1 Graphical Interface from a Speech Drill in ISTRA
Note: The interface is modeled on a bowling game. The learner is prompted to say the word displayed (“rum”), and a pronunciation score appears as the number of pins knocked down. Feedback on pronunciation quality is determined by the speech recognizer as follows. In an ISTRA application, templates are made using four tokens of a client’s current best productions of a target word as judged by the speech clinician. The clinician elicits these good productions by using traditional articulation training methods. Once a template for an improved target is made, the client can practice the word on the computer without Volume 16 Number 3
427
Explicit Pronunciation Training supervision. Feedback is derived from the distance metric of the recognizer, comparing a current production with the template. This ASR-derived metric has been shown to correlate well with speech quality judgments made by human listeners (Watson et al., 1989; Anderson & KewleyPort, 1995). To inform learners of the goodness of their pronunciation over a series of attempts for a set of words, the system uses graphical displays such as bargraphs. Studies of ISTRA show that children as young as four years can understand and benefit from bargraph feedback (Kewley-Port et al., 1991).
NATURE OF THE SECOND LANGUAGE PRONUNCIATION PROBLEM: HOW NATIVE LANGUAGE PHONOLOGY AND PHONETICS INTERFERE Phonemes Versus Phonetic Features
A key distinction in our discussion is between phonology and phonetics. Whereas articulation of phonemes is called segmental and use of prosody is suprasegmental (involving intonation and stress), phonetic features might be called subsegmental—that is, they concern articulation differences within the phoneme (or within-segment deviations), such as the time of voice onset in the stops /b/ and /d/, discussed below. Both levels, phonological and phonetic, play a role in interference from the native language to the language being learned.
Sources of Interference
Recent research has shown that the difficulty adult learners have in mastering the phonological and phonetic systems of a new language occurs in both the speech production and speech perception domains (Strange, 1995). In a general way this difficulty is caused by differences in structure of the phonological and phonetic systems between the target and the native languages. However, the manner in which the structure of a first language (L1) may interfere with learning the sound system of a second language (L2) turns out to be complex.
PHONOLOGICAL INTERFERENCE
Certain types of L1 interference in L2 phonological learning are reasonably obvious and might even be predicted a priori from knowledge of the 428
CALICO Journal
Jonathan Dalby and Diane Kewley-Port phonological structures of the two languages. For example, it is not very surprising that Chinese learners of English have difficulty producing and perceiving English syllable-final consonants and consonant clusters, because few such sequences occur in (the major dialects of) Chinese (Kenworthy, 1987; Flege & Wang, 1989; Anderson, 1983). Nor is it very surprising that native Spanish speakers of English tend to transfer Spanish phonological rules to English words inappropriately. One such Spanish phonological rule realizes voiced stop consonants (e.g., the /d/ sound in “dog”) as fricatives when they occur between vowels (cada ‘each’). English words such as “ladder” thus tend to be pronounced “lather” by these speakers (Flege & Davidian, 1984). American English learners of Spanish have a similar kind of interference from the English rule that “flaps” intervocalic apical stops (i.e., /t, d/) in English (as in “ladder”). When learners inappropriately apply that rule in speaking Spanish, they produce a sound like the Spanish tapped /r/. As a result, American English pronunciations of, for example, Spanish cada (phonetically /kada/) sound like cara ‘face’ (phonetically /kara/) to Spanish-speaking listeners.
PHONETIC INTERFERENCE
It is less obvious that phonetic similarity can also contribute to the difficulty of L2 learning. Phonetic distinctions among vowels may be reflected in the acoustic signal by small differences in formants. Formants are resonance or vibration bands in the frequency spectrum that determine the quality of a vowel. Flege (1987) demonstrated that American English learners of French were more accurate in terms of formant frequency values in their productions of the French high front rounded vowel /y/ (as in tu ‘you’), which does not occur in English, than they were in their productions of French /u/ (as in tous ‘all’), which has a close but not identical counterpart in English (the /u/ in “to”). He hypothesizes that native English speakers fail to learn to produce the phone that is closer in formant space to an English phone due to “equivalence classification.” That is, the /u/ of English is perceptually close enough to French /u/ that learners use the English phone rather than learn a new sound. The notorious difficulty Japanese learners of English have with the /r, l/ distinction may also be due to the existence of similar sounds in Japanese. In this case the two sounds are allophonic variants of a single phoneme in Japanese, rather than separate phonemes as they are in English (Miawaki et al., 1975). Many of the characteristics of foreign-accented speech can be properly characterized only at the phonetic level of analysis, the level at which the acoustic cues to phonemic identity are produced and perceived (see Eskenazi, this issue). The acoustic-phonetic details of the encoding of the “same” phonological contrasts can vary greatly from language to language. Volume 16 Number 3
429
Explicit Pronunciation Training The voicing contrast between Spanish and English in syllable-initial stops, such as /d/ in “dog,” provides an example. In this position the distinction between /p, t, k/ and /b, d, g/ is largely cued acoustically by differences in voice onset time (VOT) relative to the release of the stop closure (Lisker & Abramson, 1964). This is true in both Spanish and English, but the boundary between the two classes of sounds is very different in the two languages. English contrasts “long lag” voiceless stops with voiced stops that have short lag or short lead, while Spanish contrasts short lag voiceless and long lead voiced stops in this position.1 An English-accented Spanish /b/ may well be heard as /p/ by native Spanish listeners, and a Spanish-accented English /p/ can be confused with /b/ by native English listeners (e.g., the Spanish speaker’s “pat” heard as “bat”) (Williams, 1979). Fine-grained articulatory and perceptual patterns such as VOT tend to be transferred from L1 to L2 (Port & Mitleb, 1983). Because these patterns involve complex articulatory and perceptual habits, they can be very difficult to modify. It has been demonstrated that subsegmental deviations from native norms such as these play a role in native listeners’ perception of foreign accent (Flege, 1984; see also Eskenazi, this issue), and it is likely that they also have important consequences for the intelligibility of nonnative speech. Rochet (1995) describes a case of phonological interference that illustrates the need for detailed language-specific analysis of speech production errors. The facts in the case are the more pointed for being, on the surface, quite counter-intuitive. English and Brazilian Portuguese each have two high vowels, /i/ and /u/. French has three, /i/ and /u/ and the high front round /y/. English speaking learners of French tend to substitute English /u/ for French /y/ whereas Brazilian Portuguese learners of French tend to substitute their /i/ for this novel phoneme. These facts pose something of a conundrum for those who would like to base predictions of learning difficulty strictly on phonological analyses. In terms of phonemes, English and Brazilian Portuguese have what appears to be the same difference with respect to French—namely two high vowels as opposed to three. Only by examining differences in the category boundaries in the three languages—that is, phonetic differences—can the difference in perception be understood. Using synthetic stimuli that varied the second formant in a continuum from /i/ to /u/, Rochet (1995) showed that English and Portuguese speakers divide this perceptual space differently. Portuguese speakers identified more of the stimuli as /i/ than as /u/ and English speakers did the opposite. Stimuli in the middle of the continuum (which French speakers identified as /y/) were thus classified differently by English and Portuguese speakers. In summary, as Rochet (1995) points out, these examples underscore the need for detailed acoustic-phonetic analysis if the effects of a given L1 phonology on the learning of a given L2 are to be understood. 430
CALICO Journal
Jonathan Dalby and Diane Kewley-Port THE RELATION OF SPEECH PERCEPTION AND SPEECH PRODUCTION TRAINING Perception Precedes Production?
The relation of speech production ability to speech perception ability in second language learning is not very well understood at present. While second language teachers often assume that students must be able to perceive an L2 contrast before they can learn to produce it, this is not necessarily always the case. Goto (1971) and later Sheldon and Strange (1982) showed that some Japanese learners of English were able to produce the English /r, l/ contrast more reliably than they could perceive it.
Learning to Perceive
Speech perception training for second languages has been well studied in recent years. This research has yielded several interesting facts that have led us to include perception training as an important component of the Pronto system. Logan, Lively, and Pisoni (1991) showed that perception training with natural speech tokens of American English /l/ and /r/ in several phonological environments spoken by multiple talkers was effective in training Japanese learners to perceive this novel (and difficult) contrast. The subjects in this study showed not only improved identification (and lowered response latencies) for the words actually trained, but also generalization of training to new words containing these sounds, spoken by new talkers. The generalization of training effect in this study contrasts with the later finding of no generalization for subjects trained on a single talker (Lively, Logan, & Pisoni, 1993). Furthermore, Lively et al. (1994) showed that training of this sort can result in changes in adults’ second language perception that persist over time. Subjects trained using this paradigm who were retested after three months showed that they had retained their improved ability to correctly identify words containing these sounds.
Perception as a Route to Production
Rochet (1995) cites a study showing that perception training can actually improve speech production skills (see also Bradlow et al., 1996). Native Mandarin Chinese-speaking subjects in this experiment were trained to modify their French VOT boundary for /b/ and /p/ toward native French values using synthetic speech stimuli in a “bu/pu” continuum. This perception training by itself was shown to improve the subjects’ correct proVolume 16 Number 3
431
Explicit Pronunciation Training ductions of words containing these phonemes as measured by native French listeners’ classification of the words in tests given before and after training. Subjects’ perception of the French category boundary was also improved, and this result generalized to /b/ and /p/ before different untrained vowels as well as to new voiced/voiceless consonant pairs /g, k/ and /t, d/ before /u/. Importantly, this training did not generalize to different word positions for these phonemes—for example, syllable-final or intervocalic positions of /b/ and /p/. This fact emphasizes the need to train L2 learners with words containing target contrasts in as many word positions as possible (Rochet, 1995; Lively, Logan, & Pisoni, 1993). These facts underlie the statement of two principles that have guided the development of the Pronto training modules. First, segmental errors that are typical in speakers of one language when learning a second language are not predictable from theory (at least, not from any currently developed theory) and must be determined empirically for specific L1/L2 pairs (Munro, 1991). Second, speech perception training using native productions by multiple talkers should be combined with speech production drills to have the best chance of improving the intelligibility of learners’ speech.
AN OVERVIEW OF SEGMENTAL INTELLIGIBILITY TRAINING AND ASSESSMENT Why to Train: Cognitive Costs of Listening to Accented Speech
A third principle guiding curriculum development in the Pronto system is that the effects on intelligibility of specific nonnative pronunciation errors should be established empirically. It is possible that “accented but perfectly intelligible speech” (a subjective rating by a trained Foreign Service Institute evaluator cited in Yule, 1990) does exist. It is certainly the case that there are many factors besides the pronunciation of individual speech segments that influence speech comprehension. Syntactic and semantic context (Morton, 1979), grammaticality and familiarity of topic (Gass & Veronis, 1984), and familiarity of accent and speaker (Brodkey, 1972) are additional factors that influence how well listeners understand speech. But human speech understanding cannot operate completely “topdown.” Errors in the sensory input that result in bottom-up processing errors or uncertainties for listeners will reduce the comprehensibility of an utterance (Marslen-Wilson, 1985). Experiments with synthetic speech have shown that even when that speech is highly intelligible, it requires more cognitive effort to process (as estimated by response latencies in word/nonword classification tasks) than does natural (native) speech (Pisoni, Nusbaum, & Greene, 1985). This finding suggests that even “per432
CALICO Journal
Jonathan Dalby and Diane Kewley-Port fectly intelligible” foreign-accented speech may require more processing time than native speech and may in fact be less intelligible in suboptimal listening conditions such as over the telephone, in the presence of background noise, or under conditions of high cognitive load.
What to Train: Estimating Effects of Specific Pronunciation Errors on Intelligibility
In addition to determining empirically the kinds of pronunciation errors typical of a given L1/L2 pair, the effect of these errors on the intelligibility of individual words and on utterances longer than a word should be established. For developing an efficient pronunciation training curriculum, it would also be useful to know which segmental errors are the most detrimental to overall intelligibility in the target language. While these issues have been discussed from a theoretical perspective in the past (e.g., Brown, 1988), they have not received the empirical study they deserve. Rogers and Dalby (1996) studied the effects of segmental errors in Mandarin Chinese-accented English. Segmental errors were evaluated by collecting native English listeners’ responses to Mandarin-accented productions of isolated words in a forced-choice, two-alternative identification task using minimal pairs (such as “bead, bid”). In this procedure the interpretation of the identity of the error is unambiguous, even in the presence of other possible production errors in the same word (Weismer & Martin, 1992). Sentence and passage errors were measured using a count of words correct in orthographic transcriptions of native listeners. The study showed that segmental error scores obtained from read productions of isolated words predict errors in reading whole sentences and passages reasonably well (see also Rogers, 1997). Furthermore, Rogers and Dalby showed that certain segmental error types had a greater effect on connected speech intelligibility than did others. Errors in vowel tenseness, the distinction between English /i/ and /I/ in “bead, bid,” for example, affected intelligibility more than other vowel or consonant errors. Among consonant errors, errors in voicing (/p/ versus /b/ in “pat, bat”) degraded intelligibility more than other types. The establishment of a link between segmental errors in isolated words and the intelligibility (or comprehensibility) of larger utterances is important because it validates the use of minimal contrast drill (also called minimal pairs) in pronunciation training and makes it easier to evaluate the effectiveness of that training. Intelligibility training based on minimally contrasting words is typical of speech training provided to hearing-impaired and normal-hearing misarticulating children by speech-language pathologists (Kewley-Port et al., 1991). It is also widely used in second language instruction (Kenworthy, 1987; see also Wachowicz & Scott, this Volume 16 Number 3
433
Explicit Pronunciation Training issue) and has even proved effective in speech perception training (Logan et al., 1991).
How to Train: Effectiveness of Minimal Pair Training Using ASR
Rogers, Dalby, and DeVane (1994) established that speech drill using minimal pairs can result in improved productions of English words by native Mandarin Chinese speakers. They also showed that this kind of drill was effective when conducted using ASR technology. The computerbased training used by Rogers et al. employed ISTRA’s template-based, speaker-dependent word recognizer. While ISTRA was not developed for second language training, Rogers et al. (1994) used it in this preliminary study to determine whether feedback derived from a speech recognition score could improve the intelligibility of L2 speech. Pre- and posttest intelligibility ratings by a jury of native listeners showed that the training was effective. Not only did both vowel (the /i/ vs. /I/ contrast) and consonant (/th/ vs. /s/) productions improve, but the study also showed that this ASR-trained intelligibility improvement generalized to untrained words containing these contrasts. Though modest in scope, this study appears to be one of the few to show experimentally that speech production skills can improve with training that uses ASR technology. The importance of such studies in establishing the validity of this kind of training cannot be overstated.
PRONUNCIATION TRAINING IN THE PRONTO SYSTEM Methods for Developing Curricula for Segmental Intelligibility Training
The following sections will describe the methods used for developing pronunciation training modules for specific L1/L2 pairs for Pronto. We believe the development path described here is unique in that it relies on linguistic analysis to establish, to the greatest extent possible, the specifications of a suitable ASR technology. This contrasts with approaches in which the capabilities of the selected speech recognizer determine what will be trained. Step 1. Creating the Segmental Inventory Test: The first step in the empirical discovery of L2 pronunciation difficulties for modules of the Pronto system is to create word lists that contain all the vowel, diphthong, and consonant phonemes of the target language in as many syllable positions as possible. Ideally this would be an exhaustive inventory. With a language like Spanish it is practical to approximate this comprehensive coverage more closely than it is for English, since the number of initial, me434
CALICO Journal
Jonathan Dalby and Diane Kewley-Port dial, and final consonant clusters in Spanish is relatively small. With American English as the target language, however, it is impractical to include all the medial clusters of the language in the list: American English has 67 possible syllable-initial clusters, about 173 possible syllable-final clusters, and very many more possible word-medial clusters. The necessary economy of sampling medial clusters is not ideal, but we have rationalized that it is at least appropriate because all medial clusters in English are composed of a possible syllable final cluster followed by one of the possible syllable initial clusters. To date we have versions of a Segmental Inventory Test for General American English and for Spanish. The American English list contains about 360 words and the Spanish list contains about 230. Both lists also contain several polysyllabic words designed to elicit typical errors in stress placement. Step 2. Analyzing Errors from the Segmental Inventory Test: The second development step for each training module is to perform an error analysis of accented speech. Digital audio recordings are collected of native L1 speakers reading the L2 segmental inventory test. The speakers are selected to represent different levels of ability in L2. Phoneticians using a consensus method between multiple transcribers then carefully transcribe these recordings. The errors produced in the readings are collated and counted, yielding a subset of the segmental inventory test containing the words in which segmental errors were actually made by the talkers. To help ensure that the errors are representative of the group of speakers, errors that occur only once are eliminated. The error analysis process is labor intensive, but one can be reasonably confident that it yields an error list that is typical of segmental pronunciation difficulties for a specific group of speakers. Error analysis of Mandarin Chinese-accented English yielded an inventory of 45 consonantal and 13 vocalic errors (Rogers, 1997). Analysis of Spanish-accented English produced 60 consonantal and 19 vocalic errors, while an analysis of English-accented Spanish showed 54 consonantal errors and 55 vocalic errors. In addition to segmental errors, word-level stress placement errors are also transcribed and tabulated. Step 3. Developing the Pronunciation Training Sequence: Motivated by the documented success of phonological contrast training, lists of minimal pairs of words are created containing the segmental contrasts derived from the error analysis. To speed this process, software has been developed to search machine-readable dictionaries that include phonemic transcriptions. This software allows the user to specify a contrast and a phonological environment; it then searches the lexicon exhaustively for word pairs containing the contrast. In the English/Spanish module, for example, one error from the Segmental Inventory Test by native speakers of English was pronouncing intervocalic Spanish /d/ as Spanish /r/, based on interference from the English rule that flaps /d/ between vowels, discussed Volume 16 Number 3
435
Explicit Pronunciation Training earlier. The rule defining that error would be input as follows: V d V —> V r V (i.e., /d/ is pronounced /r/ between vowels). Given this rule, the program produces (eventually) a list of Spanish word pairs such as acabada/acabara ‘finished/it finished’ afectada/afectara ‘affected/it affected’ and so on. For the above rule, the program produced 363 pairs, to be used for training American speakers away from applying the flapping rule to Spanish words. The software is currently not sensitive to word frequency and sometimes produces pairs containing extremely infrequent words, which have to be eliminated by hand by native speakers of the target language. Step 4. Ranking the Importance of Errors: The next step is to determine the importance of each error in the module’s error list, as estimated by phoneme frequency and the relative number of minimal pairs representing the error found in the lexicon. A formula provides a measure of importance that is used in editing the lists of minimal pairs to derive a set of training words for the speech production and perception drills. As a result of this editing, the minimal pairs represent the errors in the error list in numbers that are roughly proportional to the error’s estimated importance. Assuming that students will have equal difficulty in learning each of the contrasts, this distribution of training pairs allows for more training time to be allocated to the more important contrasts. Step 5. Recording Native Speakers: The edited error lists are given to native speakers to read and record. The digitized waveforms are used in perception training as well as in training and evaluating the speech recognizer used in the Pronto system. Step 6. Ordering by Complexity: While the Pronto system does not rank single contrasts relative to other single contrasts in terms of expected learning difficulty, it does assume that words containing more than one error environment are more difficult for learners to produce and perceive than are those containing a single error environment. Thus, early training is conducted using word pairs with only one error environment. Words containing multiple environments are introduced only after improvement has been shown for each of the individual errors.
436
CALICO Journal
Jonathan Dalby and Diane Kewley-Port Learner Tasks, Feedback, and Adaptive Sequencing in Pronto
It is not the case that all typical L2 errors present equal learning difficulty. However, aside from such notable errors as English /l, r/ for Japanese and speakers of certain other Asian languages, or the difficulty native French speakers have with English /dh/ and /th/, not enough is currently known to anticipate relative learning difficulties with confidence in curriculum design. In the Pronto system we address this uncertainty with a system of scoring designed to adapt automatically to differences in learning rate for the contrasts in each module. This is done by keeping a record of student performance for each contrast on each of three tasks: 1) Word identification—Students listen to words from multiple talkers presented aurally and indicate which word they hear via mouse or keyboard; 2) Word imitation—Students repeat words presented aurally and their response is evaluated by the recognizer; 3) Word production—Students respond to visually presented prompts by speaking the word, with no immediate auditory model. The system evaluates and records performance on each of these tasks continuously. Feedback is displayed to student and instructor in a barchart, illustrated in Figure 2, which summarizes skill level by contrast pair by task (perception or production).
Volume 16 Number 3
437
Explicit Pronunciation Training Figure 2 View of the Pronto system User Interface
Note: Phonological contrast (minimal) pairs are listed in order of importance from top to bottom. Light outer portions of the horizontal bars show student’s current skill level in speech perception (left) and speech production (right). The dark inner portions of the bars indicate students’ current “intelligibility gap” for these skills. (Graphic design by Roy Sillings.) In addition to keeping current scores for each task for each contrast in the curriculum, the system keeps a global score that is a weighted sum of the scores by task. By maintaining this global training score, the system can adapt automatically to differential learning difficulty for the various contrasts. This unique aspect of Pronto training provides a mechanism derived from linguistic principles for steering the student efficiently through the curriculum. Users may choose which contrast and which types of drills they wish to practice, but their overall intelligibility profile will improve more if they show improvement on the phonetic contrasts that are more highly valued by the global training score.
438
CALICO Journal
Jonathan Dalby and Diane Kewley-Port SPEECH RECOGNIZER EVALUATION: WHICH TYPE FOR WHICH PURPOSE? A key thrust of the ISTRA program has been to identify a speech recognition engine that can be configured so that its output, a recognition score, provides a measure of speech quality that is useful as feedback to the learner. To be most useful in intelligibility training of the sort described here, the speech recognizer should meet two requirements: 1) It should be capable of highly accurate recognition of word pairs containing the target phoneme contrasts when spoken by native speakers. This establishes the baseline recognition accuracy of the system. If the recognizer cannot reliably discriminate between “thick” and “sick,” for example, when these words are produced by native speakers, it certainly will not be able to do so when presented with speech that is accented or disordered in some other way. 2) It should produce recognition scores that can be used to provide valid evaluative feedback for speech training drills. In this section we will discuss methods that we and our co-investigators have developed for assessing how well a proposed recognition technology meets the two requirements of identification accuracy and evaluative validity (Anderson & Kewley-Port, 1995; Watson et al., 1989).
HMM Versus Template-Based Recognizers
There are two main classes of recognizers: (a) Hidden Markov Model (HMM) systems, based on nondeterministic stochastic modeling, and (b) template-based systems, which perform pattern matching using dynamic programming or other time normalization techniques. HMM systems underlie many of the language tutors described in this volume (e.g., the Entropic HTK recognizer used in Subarashii, the SRI Nuance recognizer used in VILTS, and the Carnegie Mellon University Sphinx recognizer used in work reported by Aist and Mostow and by Eskenazi). These are commonly used to support continuous speech recognition. Template-based systems include the well known Scott Instruments Model SIR, no longer marketed but key to early ASR-based language tutors (e.g., Auralog’s earlier AuraLang products; see also LaRocca, 1994). A current example is Motorola’s Clamor speech recognizer. These are commonly used to support discrete speech recognition. Whereas research in speech recognition now tends to rely on the stochastic modeling of HMM, it is also true that Volume 16 Number 3
439
Explicit Pronunciation Training earlier successes with pattern-matching systems have shown them to be useful for speech training aids, as detailed below.
Tests of Speech Recognizers on Minimal Pairs
Many commercially available recognition engines in both classes have undergone extensive testing and can boast of impressively high recognition accuracy for typical voice input tasks. However, the phonetic discriminations these tasks require are not typically as challenging as are those of intelligibility training employing minimally contrasting word pairs. Using tests with minimal pairs, we have conducted our own evaluations of various commercial and experimental systems that represent both classes of technology. HMM recognizers used in past evaluations include DragonWriter from Dragon Systems (available from their Voice Tools toolkit product) and VoiceType Application Factory from IBM. Templatebased recognizers have included Micro IntroVoice from Voice Connexion and the Scott Instruments Model SIR, later implemented on the Aria chip set from Sierra Semiconductor.
Tests in ISTRA
METHOD: In the course of evaluating three speaker-dependent recognizers as candidates for use in the ISTRA system, Anderson and Kewley-Port (1995) created a database of normal and misarticulated English speech. The normal speech was from adult speakers and included minimal pairs containing the twenty-five most common substitution errors of misarticulating children. The database contained multiple repetitions of word pairs such as “some/thumb, red/wed, then/den, thin/fin” and so on. Tokens of disordered speech were collected from children undergoing speech therapy. Misarticulated tokens were recorded at the beginning of their training and improved productions were elicited at the end. The basic recognition accuracy of one HMM and two template-based recognizers was evaluated using twenty test tokens per contrast from normal adult speech. All recognizers were optimized for performance prior to testing. RESULTS—ACCURACY OF WORD IDENTIFICATION: For this data set, the HMM recognizer had the best overall accuracy with 90% correct. The best of the two template recognizers was correct on 86% of the trials while the second template recognizer was correct on only 79% of the trials. The HMM recognizer did not perform better on all the contrasts in the database, however. On the manner of articulation distinction in the pair “shore/ chore,” and the manner and place contrast in the pair “gem/them,” for example, the HMM performed worse than at least one of the template 440
CALICO Journal
Jonathan Dalby and Diane Kewley-Port recognizers, and the pattern of errors was also very different between the two template-based systems. Since the methods of signal processing and evaluation used in each of the systems differ greatly, this result is not surprising. However, it does reveal how important it is to specify the details of the recognition task when evaluating a recognizer. These data showed that all three of the recognizers had strengths and weaknesses that are not obvious from examining just the overall recognition accuracy. Results—Validity of Evaluative Scores: Speech from misarticulating children was used by Anderson and Kewley-Port (1995) to evaluate the recognizers for their capability to distinguish between tokens of words rated as “correct” versus “incorrect” by trained human listeners as well as for their potential in deriving measures of evaluative feedback to be used in speech drill. Again, the HMM recognizer was better than the template recognizers in distinguishing between the two categories of tokens. However, it was much worse than either template recognizer at providing an evaluative score for samples of disordered speech. In this test a jury of five trained listeners rated multiple tokens of words from misarticulating children on a seven-point scale where one is equal to “very poor articulation” and seven equates with “normal.” The recognizers were trained using the three best of these productions and the recognition scores for the other tokens in the set were recorded. Correlation analysis of the recognition scores from the three recognizers with the average of the human listeners was then performed. This analysis showed that the recognition scores for the template-based recognizers were quite good, quite similar in fact to interrater correlations, but that the “confidence” score returned by the HMM recognizer was not. Subsequent study of the more standard log likelihood ratios of another HMM recognizer also showed low correlation with human ratings. In summary, prior research in the ISTRA program has found HMMbased ASR more accurate for identifying which word was said and template-based ASR better for measuring quality of pronunciation, based on its distance metric. In addition, the two classes of ASR differ in terms of which contrasts each handles better. To date, the decision in ISTRA has been to rely on template-matching algorithms.
Tests in the Pronto System
Recognizer evaluation for the Pronto system has been conducted following procedures similar to those in Anderson and Kewley-Port (1995). A database of digitized speech was collected from fourteen male and fourteen female speakers of American English for the minimal pair drill derived from the error analysis of Mandarin-accented English. Baseline speaker-independent recognition rates on a subset of these data show a Volume 16 Number 3
441
Explicit Pronunciation Training similar pattern of relative performance for a template-based recognizer and an HMM recognizer. While the overall recognition accuracy for the HMM recognizer is higher, its scores are lower than those of the template recognizer for vowel and nasal contrasts. To date, neither recognizer has produced evaluative scores with acceptably high correlations with human judgments of speech quality when tested in speaker-independent mode. While research continues on deriving a valid intelligibility rating measure from a speaker-independent recognition technology, the first version of the Pronto system has been implemented using the HMM recognizer. Because pronunciation training will be conducted using the minimal pairs target/error paradigm, with both templates active at the time of recognition, the dichotomous (“hit/miss”) feedback of the system is valid. Experiments are planned for the future to assess the effectiveness of Pronto training in improving the intelligibility of learners speaking second-language words.
ACKNOWLEDGEMENT The authors would like to thank William Mills for his contributions to the development of the ISTRA and Pronto systems. This work was supported by National Institutes of Health’s National Institute on Deafness and Other Communication Disorders, SBIR grant DC02213, and by Army Research Institute contract DASW01-96-C-044 to Communication Disorders Technology, Bloomington, IN. NOTE 1
While the phonemic boundary for English voiced/voiceless stops (/b/ vs. /p/) is at the “very short lag” position, variants of voiced stops with long lags occasionally occur in English.
REFERENCES Anderson, J. I. (1983). The difficulties of English syllable structure for Chinese ESL learners. Language Learning and Communication, 2 (1), 53-61. Anderson, S., & Kewley-Port, D. (1995). Evaluation of speech recognizers for speech training applications. IEEE Proceedings on speech and audio processing, 3 (4), 229-241. Bradlow, A., Akahane-Yamada, R., Pisoni, D. B., & Tohkura, Y. (1996). Three converging tests of improvement in speech production after perceptual identification training on a non-native phonetic contrast. Journal of the Acoustical Society of America, 100 (4), Pt. 2, 2725 (A). 442
CALICO Journal
Jonathan Dalby and Diane Kewley-Port Brodkey, D. (1972). Dictation as a measure of mutual intelligibility: A pilot study. Language Learning, 22 (2), 203-217. Brown, A. (1988). Functional load and the teaching of pronunciation. TESOL Quarterly, 22, 593-606. Flege, J. E. (1984). The detection of French accent by American listeners. Journal of the Acoustical Society of America, 76, 692-707. Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics, 15, 47-65. Flege, J. E., & Davidian, R. D. (1984). Transfer and developmental processes in adult foreign language speech production. Applied Psycholinguistics, 5, 323- 347. Flege, J. E., & Wang, C. (1989). Native-language phonotactic constraints affect how well Chinese subjects perceive the word-final /t/-/d/ contrast. Journal of Phonetics, 17, 299-315. Gass, S., & Veronis, M. (1984). The effect of familiarity on the comprehensibility of non-native speech. Language Learning, 34, 65-90. Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds “l” and “r.” Neuropsychologia, 9, 317-323. Kenworthy, J. (1987). Teaching English Pronunciation. New York: Longman Kewley-Port, D., Watson, C. S., Elbert, M., Maki, D. & Reed, D. (1991). The Indiana Speech Training Aid (ISTRA) II: Training curriculum and selected case studies. Clinical Linguistics and Phonetics, 5, 13-38. LaRocca, S. (1994). Exploiting strengths and avoiding weaknesses in the use of speech recognition for language learning. CALICO Journal, 12 (1),102105. Lisker, L. & Abramson, A. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20, 384-422. Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to identify English /r/ and /l/ II: The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America, 94, 1242-1255. Lively, S. E., Pisoni, D. B., Yamada, R. A., Tohkura, Y., & Yamada, T. (1994). Training Japanese listeners to identify English /r/ and /l/ III: Long-term retention of new phonetic categories. Journal of the Acoustical Society of America, 96, 2076-2087. Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America, 89, 874-886. Marslen-Wilson, W. D. (1985). Aspects of human speech understanding. In F. False & W. A. Woods (Eds.), Computer speech processing. Englewood Cliffs, NJ: Prentice Hall. Morton, J. (1979). Word recognition structure and process. In J. Morton & J. Marshall (Eds.), Structure and process. Cambridge, MA: MIT Press. Volume 16 Number 3 443
Explicit Pronunciation Training Miawaki, K., Strange, W., Verbrugge, R., Liberman, A., Jenkins, J., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception and Psychophysics, 18 (5), 331-340. Munro, M. (1991). Perception and production of English vowels by native speakers of Arabic (Doctoral dissertation, University of Alberta, 1991). Pisoni, D. B., Nusbaum, H., & Greene, B. (1985). Perception of synthetic speech generation by rule. Proceedings of the IEEE, 73, 1665-1676. Port, R., & Mitleb, F. (1983). Segmental features and implementation in acquisition of English by Arabic speakers. Journal of Phonetics, 11, 219-229. Rochet, B. L. (1995). Perception and production of second-language speech sounds by adults. In W. Strange (Ed.), Speech perception and linguistic experience. Timonium, MD: York Press. Rogers, C. L. (1997). Segmental intelligibility assessment for Chinese-accented English (Doctoral dissertation, University of Indiana, 1997). Rogers, C. L., & Dalby, J. M. (1996). Prediction of foreign-accented speech intelligibility from segmental contrast measures. Journal of the Acoustical Society of America, 100 (4) Pt. 2, 2725 (A). Rogers, C. L., Dalby, J. M., & DeVane, G. (1994). Intelligibility training for foreign-accented speech: A preliminary study. Journal of the Acoustical Society of America, 96 (5) Pt. 2, 3348 (A). Sheldon, A., & Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3, 243-261. Strange, W. (1995). Cross-language studies of speech perception a historical review. In W. Strange (Ed), Speech perception and linguistic experience. Timonium, MD: York Press. Watson, C. S., Reed, D., Kewley-Port, D., & Maki, D. (1989). The Indiana Speech Training Aid (ISTRA) I: Comparisons between human and computerbased evaluation of speech quality. Journal of Speech and Hearing Research, 32, 245-251. Weismer, G., & Martin, R. (1992). Acoustic and perceptual approaches to the study of intelligibility. In R. D. Kent (Ed.), Intelligibility in speech disorders: Theory, measurement and management. Amsterdam: J. Benjamins. Williams, L. (1979). The modification of speech perception and production in second-language learning. Perception and Psychophysics, 26 (2), 95-104. Yule, G. (1990). The spoken language. Annual Review of Applied Linguistics, 10, 163-172.
444
CALICO Journal
Jonathan Dalby and Diane Kewley-Port AUTHORS’ BIODATA Jonathan Dalby is Senior Scientist at Communication Disorders Technology (CDT), Inc., where he conducts research and development of speech training systems that employ automatic speech recognition technology. Before joining CDT, he served as Research Associate at the Centre for Speech Technology Research at the University of Edinburgh, Scotland, and participated in the development of a large-vocabulary continuous speech recognition system. Previously, he studied speech production and speech perception in the Phonetics Laboratory at Indiana University, and he taught English as a Second Language for several years both overseas and in the United States. His Ph.D. in linguistics is from Indiana University. Diane Kewley-Port is Associate Professor in the Department of Speech and Hearing Sciences at Indiana University as well as cofounder and Executive Vice President of Communication Disorders Technology, Inc. She has studied the use of automatic speech recognition in speech training systems since 1987. Previously, she conducted research in speech signal processing and speech perception for several years at Haskins Laboratories and at Bell Laboratories. A Fellow of the Acoustical Society of America, she is past associate editor of topics in speech processing and communication systems for the Journal of the Acoustical Society of America. Her Ph.D. in speech sciences is from City University of New York. She won the Edward Sapir Award for best dissertation in linguistics and, as a University of Michigan student, the Sarah Parker Memorial Award as outstanding woman engineer.
AUTHORS’ ADDRESSES Jonathan Dalby Communication Disorders Technology, Inc. 501 North Morton Street #215 Bloomington, IN 47404 Phone: 812/336-1766 E-Mail:
[email protected] Professor Diane Kewley-Port Department of Speech and Hearing Sciences Indiana University Bloomington, IN 47405 Phone: 812/855-5103 E-Mail:
[email protected] Volume 16 Number 3
445
Explicit Pronunciation Training
CALICO ‘99 1-5 June 1999 CALICO
CALICO
Tuesday-Wednesday
1-2 June
Preconference Workshops
Thursday-Saturday
3-5 June
Opening Plenary, Sessions, Exhibits, Luncheon, Courseware Showcase, SIG Meetings, Banquet, Closing Plenary
Plenary speakers
G. Richard Tucker, Carnegie Mellon University Diane Birckbichler, Ohio State University Gary Strong, National Science Foundation.
Register online at http://calico.org/calico99.html Early (before 1 May) Member Nonmember K-12 or Community College Saturday only
with luncheon & banquet $175 $200 $125
no luncheon or banquet $165 $190 $115 $50
Regular (after 1 May) Member Nonmember K-12 or Community College Saturday only
$200 $225 $150
On-site Member Nonmember K-12 or Community College Saturday only
$225 $250 $175
$190 $215 $140 $55 $215 $240 $165 $60
This year’s conference does not have a designated conference hotel. Lodging is available in residence halls on campus and at motels in the area. For more information, visit CALICO’s web site. Ascot Travel is the official travel agency for CALICO ‘99 and offers special discount fares on Delta Airlines. Visit CALICO’s web site or call Ascot Travel at 800/460-2471. Be sure to mention you are part of the CALICO group.
For more information, contact CALICO 512/245-1417,
[email protected], http://www.calico.org. 446
CALICO Journal