automatically generated word pronunciations from

AUTOMATICALLY GENERATED WORD PRONUNCIATIONS FROM PHONEME CLASSIFIER OUTPUT Philipp Schmid, Ronald Cole, Mark Fanty

Center for Spoken Language Understanding Oregon Graduate Institute of Science & Technology Beaverton, Oregon 97006 { 1999 USA netically labeled speech corpus (TIMIT). The estimates derived from TIMIT were then used to generate pronunciation networks from word baseforms in the DARPA Resource Management (RM) task. A signi cant improvement in recognition accuracy was obtained on the RM task using the pronunciation networks thus derived, relative to the baseform pronunciations. The approaches described above model variability introduced by the speaker. It might also be helpful to model the the variability introduced by the recognizer. Both humans and machines must cope with recognition errors at the phonological level. Phonetic classi cation is errorful, and becomes more errorful in conditions of noise or limited channel bandwidth. (These misperceptions often go unnoticed in uent speech, since the perceptual system is able to use redundant information to correctly recognize the intended word.) The approach described here models word-level pronunciation variability directly from machine-generated subword units, as proposed by Murveit et al. [4] and Rulot et al. [5]. This approach oers the potential to model variability contributed by both the speaker and recognizer. For segment-based recognition systems, this requires modeling the insertions, deletions and substitutions produced by the phonetic recognizer. We note that no amount of modeling can fully compensate for phonetic recognition errors in some situations. For example, when recognizing isolated English letters, misclassi cation of /b/ as /p/ will cause \B" to be misclassi ed as \P." There is, however, still bene t to be obtained from modeling the frequency of occurrence of these substitutions; if the recognizer often misclassi es /b/ as /p/ in the training set, \B" will receive a high score when /b/ is misclassi ed as /p/, since the word model for B will expect /b/-/p/ substitutions. Having reasonable scores for misrecognized phonemes is an important bene t of this approach, since speech systems must combine scores across segments during word recognition. To test the viability of this approach, we present here an initial experiment in which word models are created automatically using machine-generated phonetic labels from up to ten examples each of forty dierent words. Starting with a dictionary pronunciation reduced to broad phonetic category labels, a dynamic programming algorithm incre-

ABSTRACT

We describe an automatic procedure for modeling alternate pronunciations of words produced by dierent talkers. The research compared recognition performance on forty city and state names using three dierent representations of each word. In the rst case, the expected pronunciation(s) of each word was produced by an expert. In the second case, a dynamic programming algorithm was used to create a pronunciation network for each word by combining phonetic transcriptions from ten utterances of the word produced by human labelers. The third case was identical to the second, except that the phonetic labels were provided automatically by a phonetic recognition algorithm. On a test set of words produced by new speakers, equivalent recognition performance was observed for the pronunciation networks derived from human and machine labels, and both produced superior performance to that obtained with the pronunciations produced by the expert. 1.

INTRODUCTION

There is considerable variation in the pronunciation of words caused by phonological eects such as palatalization (\whacha doing?") social background, and other factors. This is a major problem in computer speech recognition. To recognize words produced by dierent speakers or by the same speaker in dierent contexts, the system must model alternate pronunciations. Several researchers have applied phonological rules to phonemic baseform pronunciations to create word pronunciation networks that represent alternate pronunciations of a word. Rules, for example, may account for deletion of word- nal stop consonants or apping of word-medial /t/ between stressed and unstressed syllables. The use of rules to model pronunciation variability has led to performance improvements in the MIT SUMMIT system and the SRI DECIPHER system [1] [2]. In addition, SUMMIT uses automatic procedures to adjust the transition weights in the pronunciation networks. Riley and Ljolje [3] use a statistical procedure to determine the phonetic realization of phonemes in word baseforms (the \expected" pronunciation). Taking account of lexical stress and word boundary information, they generated statistics for phonemes in word baseforms from a phoProceedings of ICASSP, Minneapolis, MN, April 1993.

1

c IEEE 1993

mentally builds a pronunciation network from the phonetic string produced by the recognizer by inserting the sequence of phonetic segments into the existing pronunciation network. We compare recognition performance on test utterances using pronunciation networks based on machine generated phonetic labels, phonetic labels provided by humans, and a baseform pronunciation for each word provided by an expert. 2.

dictionary lookup. This helps preserve the broad phonetic structure during the alignment process. These initial broad phonetic nodes were removed once the word model had been built (so the recognizer does not recognize a word only the on the basis of its broad phonetic structure). A count is kept for each arc of the number of traversals during graph building. These counts are used to generate transition probabilities in the nal network. The min and max duration of each state is recorded and used to constrain the alignment during recognition.

THE ALGORITHM

The algorithm incrementally builds the pronunciation network for a given word from phonetic transcriptions produced by a phonetic classi er (see section 3. for details). The objective is to model all observed pronunciations while keeping the network as small as possible and free from cycles. A dynamic programming algorithm computes the best alignment of a string fsj g (single pronunciation) with a graph fgi g (pronunciation network). The following cost matrix is computed: mi;j

= min

0

k

M

f

mk;j

3.

01 + arc(k; i) + sub(i; j )g ;

where mij represents the cost of aligning sj with graph node gi , M is the current number of states in the graph fgi g and

(

) =

arc k; i

(

sub i; j

) =

8 >> 0 >> < >> 1 >>: 1

8> >> 0 >> >< 1 >> >> 2 >> : 3

if there is an arc from node k to node i if node i is closer to root than node k (prevent cycles) otherwise (an arc would have to be inserted)

if the label vertice gi = label of sj

of

c

if gi and sj are in the same broad phonetic category

c

both are either vowels or consonants

c

GENERATING MACHINE LABELS

In this section we describe the procedure for generating the phoneme sequence fsi g automatically using our general purpose phonetic frontend. To generate a pronunciation string for each word, a neural network rst assigns a score for each of 39 phonemes to each 6 msec frame in the word. The details of the network are given in [6] in this proceedings. A Viterbi search then nds the best scoring sequence of broad phonetic category segments, subject to minimal and maximal duration constraints as well as ordering constraints (e.g. no 2 stop segments in a row). The broad category scores are obtained by summing the outputs of the individual phonetic categories; e.g. /b/ + /d/ + /g/ + /p/ + /t/ + /k/ = STOP. Each broad category interval is assigned the phonetic or broad category label with the highest average score. This procedure results in fewer spurious segments compared to straight phoneme recognition. Browsing the generated pronunciation sequences we found that in regions where the phonetic frontend was uncertain (almost equal scores for several phonemes) the label assigned for that region by the Viterbi search is more likely to be wrong. Since the frontend is uncertain, we reasoned that it might be better to not include the phoneme which (happened to) have the highest score in the pronunciation network. Since the segment cannot be skipped, we enhanced our generating mechanism to include broad phonetic labels in regions of uncertainty. The score for a given broad phonetic category bc was scaled by the following rule: bc

= f 2 (1 0 (best 0 second)) 2 bc

;

(1)

where best and second are the highest and second highest scores of the frame. If there's a large dierence between the best and second best score, we conclude that the network is certain in its decision and therefore downscale the broad category score. If in turn both scores are close together the scores for the broad categories are increased and thus being favored by the search. The scaling factor f is used to control the ratio between ne and broad phonetic labels. Figure 1 shows the pronunciation network for \Portland" using 10 examples and setting the scaling factor to 1.0. The word initial /l/ and the phoneme /m/ following the closure are examples of the of substitution errors generated by the phonetic frontend. It can be seen that the classi er provides several choices for the (second) un-

otherwise

The cost of aligning sj is computed recursively from the cost of aligning all sr for r < j . There is a penalty for not following an existing arc or for creating a new node, so the graph will grow only when necessary. At each entry in the matrix the value of k that produced the minimal cost is stored in order to construct the alignment path once the matrix is computed. Special measures (we use a distance criteria) have to be taken to assure that no cycles are introduced into the graph. In practice we found it useful to initialize the graph with the broad phonetic structure of the word derived from a 2

mid

m

stop

p

aor

cl

l

stop

back

ah

l

cl

n

diph

fric

n

ah

stop

s

Figure 1. Pronunciation network for the word \Portland" based on 10 machine labeled examples

stressed syllable. The word nal segments /ah/ /n/ and /s/ correspond to the breath release following the word. 4.

The phonetic frontend was trained on a large number of city, state and surnames from the same speech corpus. There was a partial overlap in the training sets used to train the phonetic frontend and the pronunciation networks. We compared three pronunciation models for each word; (a) a model produced by an expert; (b) a model generated automatically from the ten utterances of each word in the training set, using hand labels provided by humans for each utterance; and (c) a model generated automatically from the ten utterances of each word in the training set, using machine generated labels, as described above. The pronunciation model produced by the expert contained the expected phonemic pronunciation of the word, common alternate pronunciations (such as word- nal stop deletion and apping), and syllable stress. The phonetic labels were provided by full time speech corpus development sta at the Center for Spoken Language Understanding. The machine labels were generated from the ten utterances of each word in the training set, as described above. The recognition system described in [6] was used throughout the experiments. On the 160 words in the development test set, the system recognized 85.0% of the words using the models generated by an expert, and 87.5% using the models generated from the hand labeled utterances. To build word models based on automatically generated pronunciations (machine labels), we rst examined the in uence of the scaling factor f in equation 1 on the recognition accuracy for the development test set. The results are shown in table 1. Using broad categories in the nal pronunciation networks (f = 0:1) resulted in a slight improvement on the development test.

RECOGNITION

The recognition system used is described in [6] in this proceedings. A neural network (the same used to generate the pronunciations as described above) generates scores for each of 39 phonemes every 6 msec. A Viterbi search then nds the best-scoring word sequence (for this paper, the best-scoring word) by trying all pronunciations of every word as speci ed by the pronunciation networks. Every state of a word model corresponds to an output (phoneme) of the neural network. The score for a word is the product, for each frame, of the output scores of the aligned phoneme. In addition, every state transition has a probability as determined by the counts when the pronunciation network was built. The states are constrained to have durations between the minimum and maximum seen during construction of the pronunciation network. 5.

EXPERIMENTS

The speech data used in the experiments consisted of utterances of forty common city and state names taken from the telephone speech corpus described in [7].1 These names were chosen so there would be sucient training and test data. The utterances were given by callers in response to questions about their home town and the city the were calling from. Every utterance in the training, development and nal test set is from a dierent speaker. For the work reported here, the speech corresponding to the target name was extracted by hand when the original response had more than one word. The training set consisted of ten utterances of each of the 40 names. The development test set consisted of four additional utterances of each name. The nal test set consisted of between 1 and 7 additional utterances of each name, for a total of 204 words. Preliminary experiments were performed with the training and development test sets to determine the best value of the scaling factor f in equation 1, and the number of utterances to be used to produce the pronunciation networks. 1 The

Factor Accuracy

1.0 81.25

0.5 86.88

0.3 88.13

0.1 90.0

0.0 88.13

Table 1. Recognition performance on the development test set Next we examined performance on the development test set as a function of the number of example utterances used to build the word models. As can be seen from table 2, using more examples per word (up to 10, anyway) increases the recognition accuracy.

corpus is available for a nominal charge to university

researchers. Contact the second author.

3

Examples per word Accuracy

4 86.9

6 88.1

8 89.4

10 90.0

however, be possible to extend this approach to natural continuous speech when large speech corpuses are available with word level transcriptions and baseform pronunciation dictionaries. Given these resources, models can be created for phonemes (or other subword units) in context from the output of a phonetic frontend. The phoneme models can then be concatenated to produce any word. To give an example, consider the construction of a phoneme model for /b/ in initial position in a stressed syllable. Each occurrence of /b/ in this context can be located in our (imaginary) corpus from the word transcriptions and pronunciation dictionary. A forced alignment of the phonetic frontend to the words in each utterance is then performed using dictionary pronunciations; this produces a time-aligned phonetic transcription with the expected pronunciation of each word. A second unconstrained pass with the phonetic frontend produces a new time-aligned transcription. The unconstrained transcription is then mapped to the baseform transcription. This mapping provides the data that can be used to create phoneme models, such as the expected number of correct recognitions, substitutions, deletions and insertions of syllable-initial /b/ before a stressed vowel. The de nition of \phoneme in context" is an interesting question for future work. It will also be interesting to learn if this approach scales to larger vocabularies where there would be greater chance for false matches to the bushier networks created automatically.

Table 2. Recognition performance on the development test set Table 3 shows the average number of states and the branching factor for each of the three conditions. As might be expected, the size of the network increases from the human-entered pronunciations to those generated automatically from hand labels and again to those generated automatically from machine labels.

Baseline Hand labels Machine labels

No. States 7.09 12.35 24.05

Branching Fac. 1.09 1.10 1.72

Table 3. Structure of the pronunciation nets showing average number of states and branching factor for the three conditions. The nal test set consisted of 204 utterances. Each word occurred between one and seven times. As can be seen from table 4 the performance of the automatically generated word models remains about the same as for the development test set where each word occurred four times. In contrast, the performance of the baseline system drops to 79%. The McNemar test indicates that the observed performance dierence between the systems where the pronunciation models were generated automatically (hand labels or machine generated labels) and the baseline system is signi cant for the nal test set (with a 1% chance of error). The dierence between systems using networks derived from hand labels and those using machine labels is not signi cant.

System Baseline Hand labels Machine labels

Dev. Test Set 85.0 87.5 90.0

7.

Research supported by US West, Oce of Naval Research and National Science Foundation. REFERENCES

[1] V. Zue, J. Glass, M. Phillips, and S. Sene. The mit summit speech recognition system: a progress report. In Proceedings of DARPA Speech and Natural Language Workshop, 1989. [2] M. Cohen. Phonological Structures for Speech Recognition. PhD thesis, Department of EE and CS, University of California,, Berkeley, CA, 1989. [3] M. Riley and A. Ljolje. Recognizing phonemes vs. recognizing phones: a comparison. In Proceedings of ICSLP, 1992. [4] H. Murveit, M. Weintraub, M. Cohen, and J.Bernstein. Lexical access with lattice input. In Proceedings of DARPA Speech and Natural Language Workshop, 1987. [5] H. Roulot, N. Prieto, and E. Vidal. Learning accurate nite{state structural models of words through the ecgi algorithm. In Proceedings of ICASSP, 1989. [6] M. Fanty, P. Schmid, and R. Cole. City name recognition over the telephone. In Proceedings of ICASSP, 1993. [7] R. Cole, K. Roginski, and M. Fanty. A telephone speech database of spelled and spoken names. In Proceedings of ICSLP, 1992.

Final Test Set 79.4 87.5 90.7

Table 4. Recognition performance on development and nal test set (f = 0:1) 6.

ACKNOWLEDGEMENTS

DISCUSSION

Our research suggests that it is feasible to model wordlevel variability in speech by creating word models from the output of a phonetic frontend. We observed that word models constructed from machine labeled utterances produced recognition performance equivalent to that obtained with word models constructed from hand labeled utterances, and that both produced superior recognition performance to that obtained with baseform pronunciations. Although word models derived from machine labels do not outperform those derived from hand labels, considerably less human eort is required. The major limitation of the current approach is the need for sucient training examples of each word. It should, 4