Modelling of pronunciation variability is an important task for the acoustic model
of an automatic speech ..... target pronunciation, and ti is the ith phoneme.
Speech Communication 46 (2005) 171–188 www.elsevier.com/locate/specom
Implicit modelling of pronunciation variation in automatic speech recognition Thomas Hain Engineering Department, Cambridge University, Trumpington Street, Cambridge CB2 1PZ, UK Received 23 February 2003; received in revised form 9 September 2004; accepted 9 September 2004
Abstract Modelling of pronunciation variability is an important task for the acoustic model of an automatic speech recognition system. Good pronunciation models contribute to the robustness and generic applicability of a speech recogniser. Usually pronunciation modelling is associated with a lexicon that allows to explicitly control the selection of appropriate HMMs for a particular word. However, the use of data-driven clustering techniques or specific parameter tying techniques has considerable impact on this form of model selection and the construction of a task-optimal dictionary. Most large vocabulary speech recognition systems make use of a dictionary with multiple possible pronunciation variants per word. By manual addition of pronunciation variants explicit human knowledge is used in the recognition process. For reasons of complexity the optimisation of manual entries for performance is often not feasible. In this paper a method for the stepwise reduction of the number of pronunciation variants per word to one is described. By doing so in a way consistent with the classification procedure, pronunciation variation is modelled implicitly. It is shown that the use of single pronunciation dictionaries provides similar or better word error rate performance, achieved both on Wall Street Journal and Switchboard data. The use of single pronunciation dictionaries in conjunction with Hidden Model Sequence Models as an example of an implicit pronunciation modelling technique shows further improvements. 2005 Elsevier B.V. All rights reserved. Keywords: Automatic speech recognition; Pronunciation modelling; Acoustic modelling; Hidden markov models; Pronunciation dictionaries; Single pronunciations; Parameter tying; Phonetic decision trees; State clustering; Conversational speech recognition; Hidden model sequence models
1. Introduction
E-mail address:
[email protected]
State-of-the-art automatic speech recognition (ASR) systems are required to operate in complex environments, both acoustically and linguistically. In order to obtain reasonable performance
0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.03.008
172
T. Hain / Speech Communication 46 (2005) 171–188
statistical pattern recognition approaches have dominated research in this field of speech recognition over recent decades. The tolerance of stochastic models to misrepresentation has proven to be vital in the development of large scale systems capable of operating in diverse frameworks. The complexity of the speech recognition task makes a separation into knowledge sources such as acoustic, pronunciation and language models a necessity. For the purpose of system development the assumption of independence of these sources is often used. However, the system complexity causes unexpected interactions between techniques and consequently the interest in improved integration of all knowledge sources in the development of ASR system has increased. Ideally the acoustic models would be split into parts that allow independent modelling of all specific speech relevant factors. In practice this separation is not straightforward or is often tied with a certain modelling technique. Only the use of words as recognition units is relatively general and common to very different approaches. In practice the need for models of continuous speech, together with the desire to cover a large variety of words makes a solely word-based acoustic modelling approach infeasible, mostly due to lack of training data and computational cost. Normally words are represented by an abstract representation of the acoustic realisation, the pronunciation string. State-of-the-art medium or large vocabulary ASR systems use pronunciation dictionaries as knowledge sources to translate individual words into model structures. This is especially the case within a hidden markov model (HMM) framework where the construction of word HMMs from phone models is straightforward. The use of individual HMMs for modelling of phonemes does not provide enough flexibility to model coarticulatory, allophonic effects. Phonemic context can be provided in the form of triphone models while model or state clustering allows to group speech models into atomic speech units. The techniques to capture phone variability are similar in most ASR systems. They allow the construction of powerful classifiers that are very flexible and automatically adjust to provide more detailed modelling in regions of increased com-
plexity. One effect of this strategy is that modification of information on a higher level in the modelling hierarchy may only have a minor effect on the actual word models, for example the change of a particular phoneme for one pronunciation of a word. Most schemes in this area are based on a maximum likelihood framework (see for example Young et al., 1994). In that case the construction of models is strongly influenced by the relative frequency of occurrence of a symbol in the training data. Frequent occurrence of a word in the training corpus implicitly allows the, though fully automatic, construction of almost word specific models. This however induces a dissociation from individual triphone models and the underlying, in symbolic form encoded phonetic interpretation. It is difficult to associate a linguistic or phonetic interpretation to individual HMM states. Normally the left-to-right modelling approach allows a segmentation of the speech signal in time. It becomes obvious that in the case of quasi-word models the state boundaries and in consequence the phone model boundaries do not necessarily reflect the phonetic segmentation of the signal. It follows that the matching properties of triphone models are not necessarily tied to good quality in signal segmentation. However, the phonetic interpretation and segmentation is often used as argument for using enriched pronunciation dictionaries. The dissociation of phoneme models from phones is of lesser concern in modelling of clean read speech, but is clearly evident in the highly variable acoustics present in spontaneous or conversational speech (Greenberg, 1996). The above properties suggest that the modelling of pronunciation variation on a symbolic level is problematic as it may be significantly altered by the flexible mappings used on lower levels. When choosing a particular pronunciation for a word three important decisions are made: the determination of the durational range; the determination of the number of pseudo-stationary states in the word; and the assignment of these states to other states of other words that may be similar in nature. The durational constraint has potentially only a minor effect on performance while the state selection is also addressed by lower level clustering stages. This paper will focus on the third aspect
T. Hain / Speech Communication 46 (2005) 171–188
of the self-consistency of a dictionary and the consistency with a particular recognition task. The rest of this paper is organised as follows: The next section provides a brief discussion of pronunciation modelling techniques with special focus on implicit modelling techniques. In order to avoid the situation where two techniques try to capture the same information, canonical pronunciations are desirable. Section 3 describes a strategy for selection of pronunciation variants to form a single pronunciation dictionary. Section 4 gives details on experiments using single pronunciation dictionaries performed on the Wall Street Journal (WSJ) and Switchboard (Swbd) corpora.
2. Modelling of pronunciation variation The factors influencing the realisation, transmission and perception of a speech signal are manifold. In most circumstances a separation of the speech signal from acoustic channel effects is only possible to a limited degree. Even though it is difficult to draw strict boundaries between pronunciation modelling and general acoustic modelling, the term pronunciation modelling is most often concerned with the definition, selection and use of symbols required to describe the acoustic realisation of an utterance. Representation on a symbolic level is accessible to human analysis and explicit rules to influence the speech model. The explicit representation of phonological and linguistic effects is accessible for data-driven learning approaches, but normally relies heavily on knowledge driven constraints. This stands in contrast to implicit models which elude strict interpretation and consequently knowledge-based analysis. ASR systems are normally based on a statistical pattern classification paradigm. Explicit approaches to modelling of pronunciation variation often use a combination of the human knowledge and statistical models to allow better integration. For a detailed discussion of pronunciation modelling techniques the interested reader is referred to (Strik and Cucchiarini, 1999) or (Humphries, 1997). Explicit encoding has the advantage that certain speech effects can be targeted explicitly, for exam-
173
ple a certain accent, sometimes without the need for new and expensive data collection. Most ASR systems use multiple pronunciations per word in training and test dictionaries to model pronunciation variation. However, more elaborate techniques are rarely used in large scale systems. In recent years the progress in automatic transcription of spontaneous or conversational speech has triggered increased interest in pronunciation modelling (see for example Fosler et al., 1996; Byrne et al., 1998, b; Ostendorf, 1999; Sarac¸lar et al., 2000; Bates and Ostendorf, 2002). The reason for the special attention is the astonishingly poor performance of speech recognisers when dealing with conversational data. Sarac¸lar et al. (2000) have shown that this can be attributed to a significant degree to a substantial increase of pronunciation variability. In an experiment conducted by Weintraub et al. (1996) and later repeated by Sarac¸lar et al. (2000), conversational speech was recorded and transcribed. Second and third recordings from the same speakers and the same text were obtained, however in read and imitating styles. Speech recognisers were trained and tested on each set of recordings. The error rate for the original conversational speech signal was more than 50% relative higher than for the two other cases. As this effect clearly must be attributed to a massive increase in pronunciation variability, extensive research in pronunciation modelling with this type of data was conducted. However, up to now performance gains from modelling of pronunciation variation fell far behind the gains anticipated from Weintraubs experiments. Most importantly improvements were achieved with schemes that incorporated the acoustic models (Riley et al., 1999). In the following we distinguish between pronunciation modelling approaches that benefit from knowledge in explicit, human-interpretable, and mostly symbolic form and those that are purely data driven and thus normally elude a strict linguistic interpretation. 2.1. Explicit models In order to describe the pronunciation of a word in the form of symbols only a finite length sequence is sensible. Thus the use of multiple
174
T. Hain / Speech Communication 46 (2005) 171–188
pronunciation variants per word can be implemented in two forms: as a simple list of independent phoneme strings or in the form of a tree-shaped pronunciation network (Wooters and Stolcke, 1994). Both representations will follow strictly left-to-right order. Unless a trivial network topology is chosen the representation in network form is more compact and may allow more natural implementation of pronunciation rules (Cremelie and Martens (1997)). For the purpose of Viterbistyle recognition the two forms are theoretically equivalent.1 Usually pronunciation modelling in training and testing is not identical. In the test case a fixed set of variants is used whereas in the training case the pronunciation variant associated with a particular word utterance is assumed to be known. In practice the variant selection is based on forced alignment with previously trained acoustic models and thus has a tendency to enforce preferences for particular variants. Since the standard technique for recognition uses hard selection, the effect of this procedure in training is similar to that used for recognition. It has to be noted that the hard selection of paths in Viterbi-based recognition is problematic for pronunciation modelling on a symbolic basis. Whereas a pronunciation network may be used to represent ambiguity rather than choice this is ignored in the standard training and recognition frameworks. In any case the use of multiple representations for a particular word increases the confusability with other words as the distance between words becomes smaller. Thus the benefit from adding new variants has to be balanced with added confusability. In practice the difference between pronunciations is difficult to assess. The effect on recognition performance can only be assessed in conjunction with both acoustic and language models. Optimisation or even measurement of confusability is non-trivial and can be explored in different contexts, for example in the context of language modelling (Printz and Olsen (2000)). One way to control the relationship between pronunciation variants is the use of pronunciation
1
A practical difference may arise when approximations such as beam pruning introduce undesired effects.
probabilities. The probabilities are used in addition to language model and acoustic scores in recognition. It is sensible to obtain the pronunciation probabilities from real data and in order to obtain reasonable estimates a large population of each word in the training set would be required. Since this is impractical and does not generalise to the use of words unseen in the training data other approaches using sub-word unit information are normally employed. For example the Expectation–Maximisation algorithm can be used to train stochastic context-free grammars which allow to assign probabilities to phoneme sequences (e.g. Cremelie and Martens, 1997). Another very simple but effective approach is based on simple pronunciation variant frequencies on the training data (Hain et al., 2001). Due to the problems outlined above smoothing of probability estimates is required. Standard approaches normally do not make use of cross-word information to alter pronunciations. An important example for extending the notion of words in a recognition dictionary is the use of multi-words (e.g. Finke and Waibel, 1997) which explicitly model phone or even syllable reductions present in spontaneous speech (Greenberg, 1998). A simple example of this reduction is the use of going_to instead of two separate dictionary entries. Work presented by Stolcke et al. (2000) shows that substantial improvements can be made by both incorporation of a relatively large number of multi-words into language models and acoustic models. However, experimental results on the use of multi-words have not been consistent among different research groups (e.g. Ma et al., 1998; Nock and Young, 1998). Most pronunciation dictionaries used in large vocabulary ASR systems are generated by automatic rule-based systems and corrected manually. This process is very expensive since manual effort has to be accompanied by extensive recognition experiments to tune performance. 2.2. Implicit modelling Explicit modelling of pronunciation variability has several disadvantages. Knowledge-based approaches are usually expensive to develop and
T. Hain / Speech Communication 46 (2005) 171–188
do not always translate well to new domains. Secondly explicit information necessarily is only available on a high level which implies a coarse influence on the model structure. In conjunction with data-driven methods, data sparsity is a likely problem due to the low symbolic rate. If the targeted pronunciation effects are not of primary importance, even explicit methods are difficult to assess. Another approach is to model the pronunciation variability with implicit statistical models. In theory this removes the necessity of high-level and symbolic approaches, and, if integrated with the general acoustic modelling approach, joint training can yield improved performance. Obviously the picture is not as clear as this argument might suggest. The disadvantage of such schemes in general is that specific targeting of pronunciation variations is difficult and that other acoustic effects can have an impact on performance. The necessarily incurred increase in model complexity is undesirable and the sets of latent variables and their dependencies are sometimes left to speculation. HMMs usually make use of mixtures of distributions of the exponential family. Assuming that pronunciation changes are sufficiently modelled by a substitution of quasi-stationary segments in the speech signal, the use of mixture distributions would suffice to reflect pronunciation variation. Such a model integrates variability from a huge variety of different variability sources and automatically adjusts to effects of greater importance. This is problematic on two accounts: first the multiple sources of variability require training data that covers not only the range for each source individually but also in combination; and secondly state or model interdependence undoubtedly exists but is only weakly modelled in HMMs as used for ASR.2 Data sparsity and the existence of longer-span contextual effects may make the use of higher level information desirable. The following briefly discusses some areas in HMM-based acoustic model-
2
The use of state clustering conditioned on phone context exerts the same condition on two neighbouring states.
175
ling where pronunciation variation is modelled implicitly. It is not the intention of this paper to give an exhaustive overview on all techniques that have been investigated. Instead the following will focus on some methods that utilise the considerable modelling power of Gaussian mixtures. A general discussion of parameter tying techniques is followed by a short description of hidden model sequence models. 2.2.1. Parameter tying If pronunciation effects are representable by symbolic replacements, Gaussian mixture models in conjunction with automatic parameter tying techniques should be capable of modelling and capturing pronunciation variability. The parameter clustering can be performed on model, state, or mixture component level. One particular advantage of this approach is that maximum likelihood or discriminative criteria can be used to assess the quality of clusters. Examples for algorithms employing the tying of states or models can be found in (Hwang et al., 1993; Young et al., 1994). Both methods are using phonetic decision trees in the clustering process. Examples for Gaussian mixture sharing are the semi-continuous HMMs as proposed by Huang and Jack (1989) or in a more general framework, the tied mixture models presented by Bellegarda and Nahamoo (1990). Another data-driven parameter tying scheme is the use of fenones as phone model building blocks (see Bahl et al., 1991) which has a more direct relationship to standard pronunciation strings. More recently Luo and Jelinek (1999) have introduced a method for the soft-tying of states, which was used by Sarac¸lar et al. (2000) to model pronunciation variability in spontaneous speech. It is important to note that parameter tying schemes are usually capable of modelling substitution effects, but maintain the temporal structure of the encapsulating HMMs. Riley et al. (1999) showed on a conversational speech task that more flexibility and better modelling can be achieved by the use of optional deletions and insertions of phonemes. 2.2.2. Hidden model sequence HMMs Hidden model sequence HMMs (HMS-HMMs) introduced in (Hain and Woodland, 1999) can be
176
T. Hain / Speech Communication 46 (2005) 171–188
interpreted as a hierarchical parameter tying scheme. The basic idea is to replace the deterministic mapping from phoneme to HMM or phoneme to HMM-state, given by phonetic decision trees, with a stochastic model. The model provides the mapping between a sequence of phonemes and a particular realisation in form of a HMM. Given a sentence hypothesis W the likelihood of an observation sequence O is described by X P ðOjWÞ ¼ P ðMjRÞP ðOjMÞ ð1Þ M2XðRÞ
where M denotes a sequence of models, R represents a phoneme sequence associated with W, X(R) the set of all possible model sequences given R, and P(MjR) represents the model sequence model (MSM). It can be shown that under mild constraints the parameters of the MSM can be jointly optimised with the HMM parameters within an Expectation–Maximisation framework. Practical implementation requires to constrain interdependence of models and phonemic symbols. One particular realisation of a HMS-HMM is designed to model only substitution effects: P ðMjRÞ ¼
L Y
P ðmt jrt1 ; rt ; rtþ1 Þ
ð2Þ
t¼1
Eq. (2) describes a situation where the number of symbols in both sequences is identical (L in this
sil
ih
l
ih
case). The choice of a particular HMM mt at place t is conditioned on the local tri-phoneme context (rt1, rt, rt+1) that is given by the pronunciation string. The set of possible HMMs M for this context depends on the centre phone rt, that is: P ðmt jrt1 ; rt ; rtþ1 Þ ¼ 0
if mt 62 Mðrt Þ
ð3Þ
The initial models in set M can be obtained from state-clustered HMMs. A model described in this way is similar in nature to clustered HMMs, the hard decision for a particular model is replaced by a statistical model. If viewed in terms of the use of parameters this can be interpreted as hierarchical tying since parameters are shared between different tri-phonemes. Formulation in this form allows a simple network interpretation. Fig. 1 shows an example of a network structure for the word elicit. The phonemic transcription of the word contains three instances of the phoneme /ih/. In each instance the same set of potential models M is used, however, depending on the context many of the model probabilities are zero or close to zero. This introduces a context dependent network topology that is automatically obtained in the training process. From the figure it is evident that this framework can be used to automatically detect the existence of substitution pronunciation effects and derive appropriate models. The topology of the HMMs used in HMSHMMs is not constrained. A case of particular
s
ih
t
sil
2 9
2
4
1
3 6
3
5
12 10
2
1
13
7
11 8 Fig. 1. HMS-HMM structure for simple modelling of substitution effects for the word ‘‘elicit’’. Each numbered node in the network represents one specific HMM. The HMM itself has arbitrary internal topology. Note that the effective set of models associated with phoneme /ih/ varies with phonemic context.
T. Hain / Speech Communication 46 (2005) 171–188
importance are single state HMMs. If we add the dependency on a particular position k within a phoneme to Eq. (4) and constrain the set of possible models to the position within a phone, i.e. P ðmt jk; rt1 ; rt ; rtþ1 Þ ¼ 0
if mt 62 Mðk; rt Þ
ð4Þ
we arrive at a formulation that describes a ‘‘soft’’ version of HMM state-tying (Young and Woodland, 1994). The practical use of HMS-HMMs has to provide means to deal with the effect of phoneme contexts not observed in the training data, but induced by a recognition dictionary and cross-word modelling. The use of probability distributions for models allows again a ‘‘soft’’ approach well known for example in language modelling: discounting and back-off to less refined model probability estimates. It has to be emphasised that the case outlined above ignores the contextual probabilistic dependency on neighbouring models and solely addresses coarse substitution effects. A more detailed modelling of substitutions can be achieved by additional controlled sharing of models between phonemes. The modelling of insertion and deletion effects require more extensive models. For a detailed discussion of HMS-HMMs the interested reader is referred to (Hain, 2001).
3. Single pronunciation dictionaries The modelling of pronunciation variability in HMM-based speech recognisers cannot be separated from other acoustic modelling issues. The use of context and more specifically the structural aspects of HMM sets such as parameter tying or mixture modelling have an impact on the performance of each individual component. The standard approach to pronunciation modelling is the use of pronunciation dictionaries with a fixed set of pronunciations for each word. The use of multiple pronunciation variants in dictionaries is commonly assessed on the basis of an existing single pronunciation dictionary. In these cases the addition of new and obvious variants usually brings improvements as long as the confusability is kept low, i.e. the number of added variants is small. The addition of variants is based
177
on the assumption that the starting point, the initial dictionary, was optimal in some sense. However, without knowledge of the task it is difficult to assess a priori the between-word confusability. The selection of one particular pronunciation variant has an influence on the models associated with all other words in the recognition dictionary. Consistency in phonemic representation may be of greater importance than an improved representation of the training utterances. Taking these facts into account the inverse to the standard situation is of interest: Given a suitable and well performing multiple pronunciation dictionary, is it possible to derive a consistent and well performing single pronunciation dictionary? What impact does this have on the performance of other pronunciation modelling techniques? The first question was addressed in a set of experiments using dictionaries derived from the LIMSI 1993 WSJ multiple pronunciation (MPron) dictionary (Gauvain et al., 1994). Pronunciations for words not contained in the original dictionary were added manually. The dictionary is the result of careful manual construction for the purpose of use in ASR systems. Using the assumption that dictionaries are effectively task dependent, single pronunciation (SPron) dictionaries are specifically constructed for each task under investigation. An automated method for dictionary construction was derived. The algorithm obtains pronunciation information from the acoustic training data to train simple statistical models that allow the selection of pronunciation variants. Since the list of words used in training usually differs from that used in recognition, the algorithm also allows the selection of pronunciations for words not observed in training. For the purpose of further discussion a joint categorisation of words and pronunciations is shown in Fig. 2, first in terms of the association with training and test dictionaries, and secondly with respect to the relationships between pronunciations for a particular word. Substitutions denote the case where one or more phonemes are changed (e.g. /dh eh r/ versus /dh ey r/) while other changes are described by insertions or deletions (e.g. /dh eh r/ versus /dh axr/). The categories
178
T. Hain / Speech Communication 46 (2005) 171–188
Fig. 2. A joint view of the list of words used in training and test. A word belongs either to the training or the test dictionary or both. The categories G, H and I denote all words with only one pronunciation, the categories D, E and F represent all pronunciations associated with the three word classes that can be described by phoneme substitutions. The remaining categories A, B and C describe words that can only be described in terms of phoneme deletions or insertions.
G, H, and I are of no concern here as they already represent words with only one pronunciation. In the case of the WSJ setup (see Section 4.1) the percentage of pronunciations associated with the categories (A, B, C, D, E, F, G, H, I) were (0.1%, 2.8%, 4.6%, 0.1%, 3.3%, 8.6%, 0.4%, 15.2%, 64.7%) respectively. The treatment of pronunciations in the categories A, B, D, E is discussed in the following section, followed by a description of processing of words in the categories C and F. 3.1. Pronunciation selection for words observed in training For the following a set of word level transcripts of the training data, the associated acoustic data, an MPron dictionary and an HMM set trained using that dictionary is required. These are used to obtain statistics from the training data as follows: (1) Pronunciation variant frequency Viterbi alignment is used to obtain a phoneme level transcription of the training data. The frequency of the pronunciation for each word in the training dictionary is obtained.
(2) Frequency-based variant pruning The pronunciations for each word in the baseline MPron dictionary are sorted according to frequency of occurrence in the training data. If a word was observed in the training data, any associated unseen pronunciation variants are deleted. All words in the dictionary not observed in training are left untouched. (3) Merging of phoneme substitutions For a given word each pair of pronunciation variants is aligned using dynamic programming, starting with the variant with the highest frequency. If for a given variant pair only substitutions of phonemes are observed, the variant with the higher frequency of occurrence is retained and the frequency of the second variant is added. If the frequencies for both variants are identical a random selection is made. This procedure is taken for each word observed in training. In step (3) variants were removed associated with categories D and E. A solution for the variants associated with categories A, B, C, and F still needs to be found. Note that Fig. 2 does not reflect the true sizes of the categories. The categories A and D are normally relatively small compared to
T. Hain / Speech Communication 46 (2005) 171–188
size of B, C, E and F and that in general considerably fewer words are associated with the categories B and C compared to E and F. After the above stages, for words observed in training, only variants remain that cannot be solely described by phoneme substitutions (categories A and B). In this case the pronunciation with the highest frequency is chosen. (4) Selection of pronunciation variants For each word the variant with the highest frequency of occurrence on the training data is retained. In the case of identical frequencies a random selection is made. It is essential to derive models from the training data that allow processing of the words not seen in the training data, i.e. the categories C and F in Fig. 2. For these cases a statistical model can be trained on decisions made on pronunciations associated with the categories A, B, D and E.
179
P ðt ¼ bjs ¼ aÞP ðs ¼ aÞ 7 P ðt ¼ ajs ¼ bÞP ðs ¼ bÞ ð6Þ Under the assumption that the priors for the source transcription are equal, i.e. P(s = a) = P(s = b) for arbitrary phoneme strings a, b, the priors can be discarded for the decision process. The use of the chain rule allows to provide an estimate for Pb ðt j sÞ: Pb ðtjsÞ ¼
M Y
P ðti jt1 ; t2 ; . . . ; ti1 ; sÞ
ð7Þ
i¼1
where M denotes the number of phonemes of the target pronunciation, and ti is the ith phoneme. Assuming that the sequences under investigation are aligned using a dynamic programming procedure, a simple model for computing Eq. (7) is given by Pb P ðtjsÞ ¼
M Y i¼1
Pb ðti jsi Þ ¼
M Y N ðti ; si Þ N ðsi Þ i¼1
ð8Þ
3.2. Pronunciation selection for unseen words In order to assess the importance of the selection process for unseen words two different approaches, further denoted as methods F and P, have been implemented. In method P the decision process uses a statistical model and decisions are made on the basis of probability estimates. For the following we assume that for each word in the dictionary there exists a canonical pronunciation, further named the source, that can be used to systematically derive all other pronunciations for that word, the targets. Given two phoneme strings a and b we would like to determine whether a is a realisation of a source pronunciation s and b is the realisation of a derivable target pronunciation t. In particular we want to know whether a is the source or b, or in other words, how predictable is the string b from a and vice versa. The decision can be made by comparing the joint probabilities of the source–target association events: P ðs ¼ a; t ¼ bÞ 7 P ðs ¼ b; t ¼ aÞ
ð5Þ
This equation can be simplified using Bayes rule:
N(ti; si) is the frequency that phoneme si occurs in the source when phoneme ti occurs in the target. Note that the source target frequencies are not symmetric, i.e. N(ti; si) 5 N(si; ti). The counts can be obtained from the words associated with the categories A, B, D and E (see Fig. 2). For these sets in the steps (3) and (4) the variant with the higher frequency was retained which for the purpose of obtaining the counts is assumed to be the source. Note that the model described in Eqs. (5) and (8) does not take the pronunciation variant frequencies on the training data into account. The final steps of method P are: (5) Model estimation Obtain frequencies of symbol substitution between source and target as required for Eq. (8), based on the source pronunciation identified in steps (3) and (4). In order to smooth the probability estimates one is added to all counts. (6P) Variant selection For the remaining words (categories C and F in Fig. 2) select variants on the basis of Eq. (5). Note that dealing with insertions and
180
T. Hain / Speech Communication 46 (2005) 171–188
deletions simply involves the introduction of an additional symbol to the standard set of phonemes. An even simpler approach is to use an approximate solution for Eq. (8). Since the basic decision rule is unaltered when using log-probabilities, Eq. (8) can be modified accordingly. Given two specific sequences a, b the decision rule Eq. (5) can be rewritten as Ca þ
M X
log N ðai ;bi Þ 7 C b þ
i¼1
M X
log N ðbi ; ai Þ
i¼1
M is the length of the DP-aligned sequences, (ai, bi) is a pair of target–source phonemes associated in alignment and Ca and Cb are constants representing the sum of the log of the frequencies of the phonemes in the training data. If we assume that Ca Cb and use log x 1 þ x, a direct comparison of counts as obtained in step (5) can be made. M X i¼1
N ðai ;bi Þ 7
M X
N ðbi ;ai Þ
4. Experiments State-of-the-art ASR systems ideally should be capable to perform well in a diverse set of conditions. Over recent years research interest has shifted from the transcription of read speech to recognition of spontaneous or conversational speech obtained from a variety of acoustic channels. Word error rates (WERs) on conversational speech are significantly higher than those obtained on read speech. Work on read speech may appear as an ideal test-bed for pronunciation modelling as it normally is recorded in well-controlled conditions and lacks acoustic distortions. However, the natural environment of, for example, a telephone conversation allows for a much greater variability in pronunciation which is substantially more difficult to model. The following sections describe experiments conducted on WSJ as a sample of a read speech corpus, and Switchboard, as example for transcription of conversational telephone speech.
ð9Þ
i¼1
Method F uses Eq. (9) for substitution-only pronunciations pairs (category F). Since Eq. (9) does not handle unseen events well and the number of words in category C can be very small only random selection is used in this case: (6F) Variant selection If pronunciations associated with a word not observed in the training data can be aligned with substitutions only, Eq. (9) is used for the decision. (7F) Random selection Any remaining pronunciation variants are either selected by frequency, if available, or on a random basis. Even though crude assumptions have been made to derive method F, its use is feasible if the test vocabulary is well covered by the training vocabulary or if the number of pronunciation variants associated with unseen words is small. The latter is of importance since variants are more likely to be used on words occurring frequently which in turn are likely to be covered in the training data.
4.1. Wall street journal The ARPA 1994 Hub1 unlimited vocabulary NAB News development and evaluation test sets (Pallett et al., 1995) were used for experiments on this corpus. Triphone mixture of Gaussian tiedstate HMM models were used with 12 mixture components for each speech state. The acoustic models are similar to the one used in (Woodland et al., 1995). The speech signal is represented by a stream of perceptual linear prediction cepstral coefficients derived from a Mel-scale filter-bank (MF-PLPs). A total of 13 coefficients, including c0, and their first and second order derivatives were used. This yields in total a 39 dimensional feature vector. Cepstral mean normalisation was performed on a per sentence basis. The acoustic training data consisted of 36 493 sentences from the SI-284 WSJ0 and WSJ1 sets. All experiments make use of trigram lattices in decoding. The dictionaries used in training and test are based on the 1993 LIMSI WSJ lexicon (Gauvain et al., 1994) which contains multiple pronunciation variants per word. The dictionary uses 46 phonemes and includes a small number of multi-
T. Hain / Speech Communication 46 (2005) 171–188
181
words. Pronunciation variants are partly rulegenerated but mostly manually optimised for use on WSJ. For training of the baseline model set an MPron dictionary containing 13 665 words was used. On average that dictionary contained 1.18 pronunciations variants per word and the maximum number of variants per word was 8. The MPron dictionary used for recognition tests was considerably larger with a total of 65 466 entries and an average of 1.11 pronunciations per word. For this dictionary a maximum of 12 variants per word was used. However, for only 6779 words alternative pronunciations are included, of which the majority holds a single alternative. The difference in the average number of variants per word between training and test dictionaries illustrates the fact that more variants are used for words that occur frequently. In the experiments multiple SPron dictionaries were constructed using different strategies: For the construction of the SPron1 dictionary pronunciation variant frequencies obtained from both the WSJ training set and the Switchboard training corpus (see Section 4.2) were used, together with method P for variant selection; dictionary SPron2 was constructed using frequencies obtained from the WSJ training data only, again with method P for pronunciation variant selection; for the third dictionary, SPron3, a purely random variant selection process was adopted. Fig. 3(a) shows the number of pronunciations in the dictionary as a function of the pronunciation variant length, for both the MPron and SPron1 dictionaries. Note that after automatic variant selection the pronunciation length distribution for the two dictionaries remain similar. In Fig. 3(b) the variant length frequencies are weighted using a unigram language model and subsequently normalised. The here visible increase in the number of pronunciations of length 3 would suggest a shorter effective variant length. However only a marginal difference in terms of average variant length can be observed. This contradicts the expectation that shorter variants are generally preferred.
third SPron dictionary, SPron3 was used as well in the experiments. In this case a purely random selection was adopted to provide a contrast to the proposed selection methods. Table 1 shows results obtained using HMM sets trained with four different dictionaries. In all cases model training involved the rebuilding of phonetic decision trees as well as successive mixture splitting. The parameters in state clustering were chosen to obtain models with a comparable number of model parameters. Not surprisingly the poorest word error rate performance is obtained using the SPron3 dictionary with a significant degradation in performance. In comparison the performance of the SPron2 dictionary is considerably improved. The difference between using the SPron1 and SPron2 dictionaries is minor. Overall the performance is only slightly poorer compared to the MPron baseline.
4.1.1. MPron versus SPron The first set of experiments investigated the basic performance of the methods proposed. A
4.1.3. HMS-HMMs In the above experiments the importance of matching model selection and dictionaries was
4.1.2. Regeneration of phonetic decision trees A second set of experiments was conducted to explore the interaction between the model structure represented in the form of phonetic decision trees and the pronunciation dictionary. Table 2 shows word error rates obtained by rescoring of trigram lattices for various combinations of model sets and dictionaries. The Baseline model set is trained using the MPron dictionary, while ReEst denotes four iterations of Baum-Welch re-estimation steps of the Baseline models using the SPron1 dictionaries in training. Note that as the phonetic decision tree is unchanged the number of parameters is identical for all model sets. The use of the SPron1 dictionary in testing only yields a severe degradation in performance by more than 23% relative which can only be partially regained by reestimation of model parameters with the SPron1 training dictionary, resulting overall in a 5% relative WER degradation. Interestingly the use of the MPron dictionary with SPron1 re-estimated model set still gives the same performance. This highlights the importance that decision trees are matched to dictionaries. In that case even a mismatch in the training data can be overcome.
182
T. Hain / Speech Communication 46 (2005) 171–188 0.5 12000
0.45 0.4 relative frequency
10000
frequency
8000
6000
4000
0.35 0.3 0.25 0.2 0.15 0.1
2000
0.05 0
2
4
6
8
(a)
10
12
14
16
0
18
2
4
6
8
(b)
# phonemes
10
12
14
16
18
14
16
18
# phonemes
10000
0.35 9000
0.3
8000
relative frequency
frequency
7000 6000 5000 4000
0.25 0.2 0.15
3000
0.1 2000
0.05
1000 0
2
4
6
8
10
12
14
16
18
0
# phonemes
2
4
6
(d)
(c)
8
10
12
# phonemes
Fig. 3. Pronunciation variant length frequency as a function of pronunciation variant length. The graphs shows the distribution for the MPron (x) and SPron (s) dictionaries. Figures (a) and (c) show absolute numbers, (b) and (d) relative frequencies. In (b) and (d) a word unigram distribution was used to weight the importance of a particular variant. (a) WSJ nb. occurrences in dictionary, (b) WSJ rel. occurrence in corpus, (c) Swbd nb. occurrences in dictionary, (d) Swbd rel. occurrence in corpus.
Table 1 %WER results on the WSJ H1 Dev and Eval test sets using different dictionaries for both training and test
Table 2 %WERs on the WSJ 1994 H1 development and evaluation test sets
Dictionary
Number of states
H1 Dev
H1 Eval
Average %WER
Model training
Test dictionary
H1 Dev
H1 Eval
Average %WER
MPron SPron1 SPron2 SPron3
6447 6419 6425 6486
8.97 9.05 9.33 9.65
9.65 9.95 9.93 10.95
9.33 9.53 9.64 10.24
Baseline Baseline ReEst ReEst
MPron SPron1 SPron1 MPron
8.97 10.95 9.37 9.07
9.65 11.97 10.31 9.50
9.33 11.48 9.86 9.30
Number of states denotes the total number of clustered states in the model set.
ReEst denotes further Baum-Welch re-estimation steps using the SPron1 dictionary.
shown. However, overall the WER performance using single pronunciation dictionaries is slightly
poorer than that for the baseline MPron dictionary. A further set of experiments addressed the
T. Hain / Speech Communication 46 (2005) 171–188 Table 3 %WER results on the WSJ H1 Dev and Eval test sets using models trained and tested with dictionaries containing one (SPron1) or multiple (MPron) pronunciations Model set
Dictionary
H1 Dev
H1 Eval
Average %WER
HMM HMS-HMM HMM HMS-HMM
MPron MPron SPron1 SPron1
8.97 9.08 9.05 8.65
9.65 9.15 9.95 9.43
9.33 9.12 9.53 9.06
question whether implicit modelling techniques can benefit from the reduction to single baseforms. As an example of an implicit modelling technique experiments used HMS-HMMs in the configuration outlined in Section 2.2.2. In that configuration HMS-HMMs allow improved modelling of substitution effects. Table 3 compares experimental results using HMS-HMMs with those obtained with standard models. Two experiments were conducted, one using HMS-HMMs in conjunction with the MPron dictionary, and the second together with the SPron1 dictionary. All HMS-HMM sets were initialised with the corresponding standard HMM models and have the same number of HMM parameters. In both experiments the use of HMS-HMMs outperforms the standard setup. In the case of using an MPron dictionary the gain is small compared to that obtained with the SPron dictionary. Overall the performance is similar in both cases and the originally poorer performance using SProns was recovered. 4.2. Switchboard The Switchboard corpus is a large collection of conversational telephone speech. Due to the nature of the data pronunciation modelling on this task attracted widespread attention (see for example Byrne et al., 1998). A detailed description of the techniques and models used in the transcription of Switchboard data would go beyond the scope of this paper. For a description of the basic techniques used in the following experiments as well as detailed descriptions of training and test sets the interested reader is referred to (Woodland et al., 2002; Hain et al., 2000, 1999).
183
Again the 1993 LIMSI WSJ lexicon served as the primary source of pronunciation strings. The test dictionary used contains 54 598 words, with an average 1.10 pronunciations per word. The training dictionary consists of 34 651 words with an average of 1.14 variants per word. When comparing the properties of Switchboard and WSJ dictionaries, the training and test dictionaries used in the experiment here have a considerably larger overlap in terms of words. The percentage of pronunciations associated with the categories (A, B, C, D, E, F, G, H, I) as shown in Fig. 2 are (0.3%, 4.6%, 1.4%, 1.0%, 6.6%, 3.1%, 11.5%, 49.4%, 22.2%) respectively. The sets C and F are small compared to the situation on WSJ and sets (A, B, D, E) are larger and provide more examples for obtaining counts. This allows to use the simple method F for the construction of the SPron dictionaries, both in training and test. Fig. 3(c) shows the number of pronunciations as a function of pronunciation variant length. Similar to the situation on WSJ the overall shape of the curve is unaffected by the selection of only one variant. However, when using a unigram language model to introduce importance weighting some differences emerge. Fig. 3(d) gives some indication that shorter pronunciations are preferred, in difference to the results on WSJ (see Fig. 3(b)). The relative weighted frequency of pronunciation with the lengths 1, 3 and 5 was increased while lower occurrence was observed for lengths 2, 4, and 6. 4.2.1. MPron versus SPron Initial experiments investigated basic performance on a 3 h test set. Table 5 shows word error rate results on the dev01sub test set. This test set covers data from three different subsets of the Switchboard corpus, Switchboard1 (Swbd1), Switchboard2-Phase3 (Swbd2) and data collected via mobile phones, Switchboard2-Phase4 (Swbd2cell). Note the different levels of difficulty for each subset. All models are trained from scratch using 287 h of training data and contain 28 mixture components per speech state. The reference MPron model set has 6165 speech states and clustering for the SPron dictionary was performed to yield a similar number. Model training further uses heteroscedastic linear discriminant analysis
184
T. Hain / Speech Communication 46 (2005) 171–188
Fig. 4. Number of speakers associated with word error rates around 10%, 30% and 50%, using MPron (left bar) and SPron (right bar) dictionaries in training and recognition.
(HLDA) and vocal tract length normalisation (VTLN). As can be observed, the use of a SPron dictionary yields lower word error rates on all subsets, seemingly with better results on more difficult data. Fig. 4 provides an analysis of the change in error rates on a per speaker basis. The range of possible WER results is divided into three categories and the number of speakers that fall into each range is counted. Note that the use of the SPron dictionary degrades performance on low error rates, but helps in higher error rate regions, with more per speaker WERs falling into the 30% category. 4.2.2. Pronunciation probabilities The hard decision on a single pronunciation can be compared to the soft decision using probabilities for each pronunciation. Both the MPron and SPron dictionaries are tested in conjunction with pronunciation probabilities (see Section 2.1). The
use of pronunciation probabilities in HTK extends to the use of different silence models which are included in the dictionary as part of the pronunciation. Three types of inter-word silence models are used: a long silence; a short pause; and no silence. Appending the silence versions to words together with probabilities allows effectively to use penalties associated with the insertion of silence models on a per pronunciation basis.3 Even though only one pronunciation exists in the SPron case each pronunciation can still occur with different silence variants and associated probabilities. The pronunciation probabilities used are frequency-based estimates obtained from the training set. The effect of pronunciation probabilities can be effectively assessed by computing of model-based entropy estimates. The associated perplexities can be interpreted as average numbers. Of particular
3
This is sometimes called a silence penalty.
T. Hain / Speech Communication 46 (2005) 171–188 Table 4 Perplexity values obtained from entropy estimates for MPron and SPron dictionaries for use on Switchboard Perplexity values 2H
Entropy estimates H
Uniform
H(W) H(WjR) H(R) H(RjW)
Unigram
MPron
SPron
MPron
SPron
54 598 1.128 85 417.0 1.765
54 598 1.125 85 369.2 1.758
2071.9 1.082 3457.5 1.834
2071.9 1.065 3201.2 1.672
The estimates are weighted either with a uniform probability distribution over all words or with a unigram language model.
interest are: the conditional entropy H(WjR) which describes the effect of homophones; the conditional entropy H(RjW) which describes the added uncertainty due to pronunciation variants; and the entropy H(R) as an overall measure of uncertainty. Table 4 shows the perplexity values 2H(Æ) obtained on the basis of the individual entropy values H(Æ) estimated solely on dictionaries or in conjunction with a unigram language model estimated on the training data. Under the assumption that all words are equally likely, the average number of words per pronunciation in the MPron and SPron cases are almost identical. However, in the unigram case a difference of almost 0.2 can be observed. Note that the average number of pronunciations per word is identical in the unigram case, and, not unexpectedly, is higher when weighting with the unigram models. The pronunciation selection in the SPron dictionary yields a lower number. The effects contribute to a 7% decrease in pronunciation variant perplexity (using unigram weighting).
185
Table 5 shows a comparison of WER performance with the use of pronunciation probabilities. The overall performance of the SPron and the MPron models is similar, apart from the more difficult Swbd2cell set. Overall the best performance is achieved using the SPron models with probabilities for the silence variants. 4.2.3. Discriminative training So far all models were trained using the maximum likelihood (ML) criterion in training. In recent years considerable improvements where shown using discriminative training schemes such as maximum mutual information estimation or minimum phone error (MPE) training (see Woodland et al., 2002). As the change in the training criterion was found to have substantial effects on pronunciation modelling in the past, a set of experiments was conducted to investigate these effects. Table 6 shows results using identical configurations, apart from employing MPE training, as described above. It is clear that the overall benefit from using SProns is reduced. It is interesting to note that word error rates are poorer on the low error rate portions of the data such as Swbd1 and significantly better on difficult data such as Swbd2cell. 4.2.4. HMS-HMMs Another issue already addressed in experiments conducted on the WSJ corpus (see Section 4.1) is the use of SPron dictionaries in conjunction with implicit pronunciation modelling techniques such as HMS-HMMs. For this purpose a smaller part of the Switchboard 1 corpus, the 18 h MiniTrain
Table 5 %WERs on dev01sub for models trained using the ML training criterion
Table 6 %WERs on dev01sub for models trained using the MPE training criterion
Dictionary MPron SPron MPron SPron
PrProb
Swbd1
Swbd2
Swbd2cell
Average %WER
Dictionary
· ·
22.4 21.6 21.5 21.3
39.4 37.9 37.9 37.7
39.0 37.8 38.1 37.4
33.5 32.3 32.4 32.0
MPron SPron MPron SPron
PrProb denotes the use of pronunciation probabilities. In the case of a SPron dictionary this denotes probabilities for silence at word ends.
PrProb
Swbd1
Swbd2
Swbd2cell
Average %WER
· ·
19.3 19.4 19.1 19.6
35.6 35.2 35.0 34.9
35.8 35.1 35.6 34.9
30.1 29.8 29.8 29.7
PrProb denotes the use of pronunciation probabilities. In the case of a SPron dictionary this denotes probabilities for silence at word ends.
186
T. Hain / Speech Communication 46 (2005) 171–188
Table 7 %WER results on Switchboard (MTtest + WS96DevSub) using models trained and tested with dictionaries containing one (SPron) or multiple (MPron) pronunciations Model set
Dictionary
Average %WER
HMM HMM
MPron SPron
45.04 44.89
HMS-HMM HMS-HMM
MPron SPron
43.38 43.12
set was used for model training. Tests were conducted on two 30 min Switchboard test sets (MTtest and WS96DevSub). The HMM baseline system has 2954 states and 12 mixture components and the ML criterion was used in training. Table 7 shows WER results obtained by rescoring of trigram lattices. Whereas the performance of MPron and SPron models is very similar, a slight advantage is given in the SPron case. This results is similar to the observations made on the WSJ corpus.
5. Conclusions A method for constructing a dictionary with only one pronunciation entry per word from a good reference dictionary was presented. The pronunciation selection process operates on a frequency basis for words observed in the training data and the decisions made in this process are used to derive models for words not observed in the training data. The performance of this dictionary in terms of word error rates was investigated both on well articulated read speech and on conversational speech. In both cases the use of the proposed dictionary gave comparable or better performance than the standard baseline system using multiple pronunciation variants. This suggests that implicit modelling of pronunciation variation can perform equally or better than the use of explicit a-priori knowledge when using a good starting point. A closer investigation revealed that in general better performance was obtained with the SPron dictionary on difficult data while the MPron dictionary gave better performance on data with low error rates.
Interactions with other pronunciation modelling techniques were investigated. In particular reduced performance gains were observed in conjunction with pronunciation probabilities. Despite the added uncertainty on the Gaussian mixture level SPron dictionaries still give slightly better performance using the minimum phone error criterion in model training. It was found that the step toward a single pronunciation variant per word can improve the performance of implicit modelling techniques. This was shown for the case of HMS-HMMs on two corpora, Wall Street Journal and Switchboard. In both cases the final performance was slightly better for the SPron case despite of different starting points. Multiple pronunciation variants are widely used in speech recognition systems. Nevertheless their use adds ambiguity that is often hard to quantify. The experimental evidence suggests that a dictionary with canonical pronunciations can yield equal or better performance than one that includes variants. This provides a vital simplification for investigations into automatic acoustic unit selection, or discriminative dictionary construction. Both topics will play an essential role in a step toward automatic generation of task optimised dictionaries from a generic representation. In this paper a symbolic distance was used to compare pronunciations which heavily depends on the amount and type of data available. In order to allow steps toward discriminative selection appropriate smoothing will be required, for example using difference metrics based on HMMs. The method presented in this paper is successfully used since 2002 in the Cambridge University HTK systems for transcription of conversational and broadcast news data (Woodland et al., 2002). Models based on SPron dictionaries give comparable or better word error rates and the output of SPron-based systems was found to be very useful for system combination.
Acknowledgement The author would like to thank BBN for providing the MiniTrain training and MTtest test set definitions and IBM for an SUR equipment
T. Hain / Speech Communication 46 (2005) 171–188
award. This work was in part supported by DARPA grant MDA972-02-1-0013. The paper does not necessarily reflect the position or the policy of the US Government and no official endorsement should be inferred. References Bahl, L.R., Bellegarda, J.R., de Souza, P.V., Gopalakrishnan, P.S., Nahamoo, D., Picheny, M.A., 1991. A new class of fenonic Markov models for large vocabulary continuous speech recognition. In: Proceedings of ICASSP91, Vol. 1. pp. 177–200. Bates, R., Ostendorf, M., 2002. Modelling pronunciation variation in modelling speech using prosody. In: Proceedings of ITRW Pronunciation Modelling and Lexicon Adaptation Workshop. pp. 42–47. Bellegarda, J.R., Nahamoo, D., 1990. Tied mixture continuous parameter modeling for speech recognition. IEEE Trans. ASSP 38 (12), 2033–2045. Byrne, W., Finke, M., Khudanpur, S., McDonough, J., Nock, H.J., Riley, M., Sarac¸clar, M., Wooters, C., Zavaliagkos, G., 1998. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. In: Proceedings of ICASSP98, Vol. 1. pp. 313–316. Cremelie, N., Martens, J.-P., 1997. Automatic rule-based generation of word pronunciation networks. In: Proceedings of EUROSPEECH97. pp. 2459–2462. Finke, M., Waibel, A., 1997. Speaking mode dependent pronunciation modelling in large vocabulary continuous speech recognition. In: Proceedings of EUROSPEECH 97, Vol. 5. Rhodes, pp. 2379–2382. Fosler, E., Weintraub, M., Wegmann, S., Kao, Y.-H., Khudanpur, S., Galles, C., Sarac¸clar, M., 1996. Automatic learning of word pronunciation from data. In: Proceedings of ICSLP96. Gauvain, J.-L., Lamel, L.F., Adda, G., Adda-Decker, M., March 1994. The LIMSI Nov93 WSJ system. In: Proceedings of 1994 ARPA Spoken Language Technology Workshop. Plainsboro, NJ, pp. 125–128. Greenberg, S., 1996. The Switchboard transcription project. 1996 LVCSR summer workshop technical reports, Center for Language and Speech Processing, Johns Hopkins University. Available from: . Greenberg, S., 1998. Speaking in shorthand—a syllable-centric perspective for understanding pronunciation variation. In: Proceedings of ESCA Workshop on modelling pronunciation variation for automatic speech recognition. Kerkrade, Netherlands, pp. 47–56. Hain, T., 2001. Hidden model sequence models for automatic speech recognition. Ph.D. thesis, Cambridge University. Hain, T., Woodland, P.C., September 1999. Dynamic HMM selection for continuous speech recognition. In: Proceedings of EUROSPEECH99, Vol. 3. pp. 1327–1330.
187
Hain, T., Woodland, P.C., Niesler, T.R., Whittaker, E.W.D., April 1999. The 1998 HTK system for transcription of conversational telephone speech. In: Proceedings of ICASSP99. pp. 57–60. Hain, T., Woodland, P.C., Evermann, G., Povey, D., May 2000. The CU HTK March 2000 Hub5 transcription system. In: Proceedings of 2000 NIST Speech Transcription Workshop. College Park, Maryland. Hain, T., Woodland, P.C., Evermann, G., Povey, D., 2001. New features in the cu-htk system for transcription of conversational telephone speech. In: Proceedings of ICASSP01. pp. 57–60. Huang, X.D., Jack, M.A., 1989. Semi-continuous hidden Markov models for speech signals. Computer Speech and Language 3, 239–251. Humphries, J.J., October 1997. Accent modelling and adaptation in automatic apeech recognition. Ph.D. thesis, Cambridge University. Hwang, M.-Y., Huang, X., Alleva, F., 1993. Predicting unseen triphones with senones. In: Proceedings of ICASSP93, Vol. 2. pp. 311–314. Luo, X., Jelinek, F., April 1999. Probabilistic classification of HMM states for large vocabulary continuous speech recognition. In: Proceedings of ICASSP99. pp. 2044–2047. Ma, K., Zavaliagkos, G., Iyer, R., 1998. BBN pronunciation modelling. Presented at the 9th Conversational Speech Recognition Workshop, MITAGS, Linthicum Heights, Maryland. Nock, H.J., Young, S.J., 1998. Detecting and correcting poor pronunciations for multiword units. In: Proceedings of ESCA Workshop on modelling pronunciation variation for automatic speech recognition. Kerkrade, Netherlands, pp. 85–90. Ostendorf, M., 1999. Moving beyond the ‘‘Beads-on-a-String’’ model of speech. In: Proceedings of 1999 IEEE Workshop on Automatic Speech Recognition and Understanding, Vol. 1. pp. 79–83. Pallett, D., Fiscus, J.G., Fisher, W., Garofolo, J.S., Lund, B.A., Martin, A., 1995. 1994 benchmark tests for the ARPA spoken language program. In: Proceedings of ARPA Workshop on Spoken Language Systems Technology. pp. 3–5. Printz, H., Olsen, P., 2000. Theory and practice of acoustic confusability. In: Proceedings of ISCA ITRW ASR2000 Workshop. Riley, M., Byrne, W., Finke, M., Khudanpur, S., Lolje, A., McDonough, J., Nock, H.J., Sarac¸clar, M., Wooters, C., Zavaliagkos, G., 1999. Stochastic pronunciation modelling from handlabelled phonetic corpora. Speech Communication 29, 209–224. Sarac¸lar, M., Nock, H.J., Khudanpur, S., 2000. Pronunciation modelling by sharing Gaussian densities across phonetic models. Computer Speech and Language 14, 137–160. Stolcke, A., Bratt, H., Butzberger, J., Franco, H., Gadde, V.R.R., Plauche, M., Richey, C., Shriberg, E., Sonmez, K.,
188
T. Hain / Speech Communication 46 (2005) 171–188
Weng, F.-L., Zhen, J., May 2000. The SRI March 2000 Hub-5 conversational speech transcription system. In: Proceedings of 2000 NIST Speech Transcription Workshop. College Park, Maryland. URL Available from: . Strik, H., Cucchiarini, C., 1999. Modelling pronunciation variation for ASR: a survey of the literature. Speech Communication 29, 225–246. Weintraub, M., Taussig, K., Hunicke-Smith, K., Snodgrass, A., 1996. Effect of speaking style on LVCSR performance. In: Proceedings of ICSLP96. pp. S16–S19. Woodland, P.C., Odell, J.J., Valtchev, V., Young, S.J., 1995. The development of the 1994 HTK large vocabulary speech recognition system. In: Proceedings of ARPA Workshop on Spoken Language Systems Technology. pp. 104–105.
Woodland, P.C., Evermann, G., Gales, M.J., Hain, T., Liu, A., Moore, G., Povey, D., Wang, L., 2002. CU-HTK APRIL 2002 Switchboard System, rich Transcription Workshop, Vienna, VA. Wooters, C., Stolcke, A., 1994. Multiple-pronunciation lexical modeling in a speaker-independent speech understanding system. In: Proceedings of ICSLP94, Vol. 3. pp. 1363–1367. Young, S.J., Woodland, P.C, 1994. State clustering in hidden Markov model-based continuous speech recognition. Computer Speech and Language 8, 369–383. Young, S.J., Odell, J.J., Woodland, P.C., 1994. Tree-based state tying for high accuracy acoustic modelling. In: Proceedings of 1994 ARPA Human Language Technology Workshop. Morgan Kaufman, pp. 307–312.