type II error rates are displayed for the one-pass (OP) and the two-pass (TP) strategies. The first row in Table I corresponds to use of the initial maximum ...
126
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures Eduardo Lleida, Member, IEEE and Richard C. Rose, Member, IEEE
Abstract—This paper introduces a set of acoustic modeling and decoding techniques for utterance verication (UV) in hidden Markov model (HMM) based continuous speech recognition (CSR). Utterance verification in this work implies the ability to determine when portions of a hypothesized word string correspond to incorrectly decoded vocabulary words or out-of-vocabulary words that may appear in an utterance. This capability is implemented here as a likelihood ratio (LR) based hypothesis testing procedure for verifying individual words in a decoded string. There are two UV techniques that are presented here. The first is a procedure for estimating the parameters of UV models during training according to an optimization criterion which is directly related to the LR measure used in UV. The second technique is a speech recognition decoding procedure where the “best” decoded path is defined to be that which optimizes a LR criterion. These techniques were evaluated in terms of their ability to improve UV performance on a speech dialog task over the public switched telephone network. The results of an experimental study presented in the paper shows that LR based parameter estimation results in a significant improvement in UV performance for this task. The study also found that the use of the LR based decoding procedure, when used in conjunction with models trained using the LR criterion, can provide as much as an 11% improvement in UV performance when compared to existing UV procedures. Finally, it was also found that the performance of the LR decoder was highly dependent on the use of the LR criterion in training acoustic models. Several observations are made in the paper concerning the formation of confidence measures for UV and the interaction of these techniques with statistical language models used in ASR. Index Terms—Acoustic modeling, confidence measures, discriminative training, large vocabulary continuous speech recognition, likelihood ratio, utterance verification.
I. INTRODUCTION
I
N AUTOMATIC speech recognition applications it is often necessary to provide a mechanism for verifying the accuracy of portions of recognition hypotheses. This paper describes utterance verification (UV) procedures for hidden Markov model (HMM) based continuous speech recognition that are based on a Likelihood Ratio (LR) criterion. Utterance verification is most often considered as a hypothesis testing problem. Existing techniques rely on a speech recognizer to produce a hypothesized word or string of words along with hypothesized word boundManuscript received June 13, 1997; revised November 2, 1998. This work was supported in part under a personal grant from the DGICYT and under Grants TIC95-0884-C04-04 and TIC95-1022-C05-02 from CICYT, Spain. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Wu Chou. E. Lleida is with the Centro Politéecnico Superior, University of Zaragoza, Zaragoza, Spain. R. C. Rose is with AT&T Laboratories—Research, Florham Park, NJ 07932 USA. Publisher Item Identifier S 1063-6676(00)01724-7.
aries obtained through Viterbi segmentation of the utterance. A measure of confidence is then assigned to the hypothesized string, and the hypothesized word labels are accepted or rejected by comparing the confidence measure to a decision threshold. LR based UV procedures generally form this confidence measure as the ratio of the target hypothesis HMM model likelihood with respect to an alternate hypothesis model likelihood. Therefore, it is further assumed that there exists an alternate hypothesis model that is used in the likelihood ratio test. The definition of the alternate hypothesis model, how its parameters are trained, and what its role is in utterance verification has been investigated in many different contexts. The techniques described in this paper address several of the short-comings associated with this general class of utterance verification techniques. The first short-coming that is addressed is the fact that UV is implemented as a “two-pass” procedure, where a ML decoder produces a string hypothesis which is verified by LR based decision rule. It is argued in Section II that there is no reason why the decoder itself cannot be configured to produce a hypothesized string that optimizes a LR criterion. A “single-pass” approach to utterance verification is proposed [8]. In this approach, instead of finding an optimum sequence of HMM states to optimize a maximum likelihood (ML) criterion, the speech recognition decoder is designed to identify the state sequence that directly optimizes a LR criterion. As a result, the decoder itself produces recognition hypotheses which optimize a LR based confidence measure and the scores produced by the decoder can themselves be combined to form a confidence measure for use in the hypothesis test. It must be acknowledged, however, that it is difficult or impossible to make any general statements concerning the relationship between the decoded state sequence obtained from a LR based decoder and that obtained from a ML decoder. There are additional issues beyond the optimization criterion used in the decoder. The HMM acoustic model training procedures, the HMM model topology, and simplifying assumptions made in implementing the decoder can all have a significant impact on ASR performance. As a result, we are careful not to make any general claims as to the comparative ASR performance between the LR decoder and the ML decoder. The results that are presented in the paper are concerned primarily with utterance verification performance in an attempt to demonstrate empirically that the algorithms and architectures presented in the paper are genuinely appropriate for that purpose. Issues relating to the design of the LR decoder and assumptions made for improving its efficiency and robustness are discussed in Section II. A second short-coming is that hypothesis testing theory in general provides no guidance for either choosing the form of
S1063–6676/00$10.00 © 2000 IEEE
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
the densities used in the LR test or for estimating their parameters. The theory assumes that these densities are known, and deals mainly with the problem of determining whether or not a given data set is well explained by the densities [6]. We attempt to over-come this lack of a criterion for specifying the model parameters used in the LR test by using a training procedure for estimating model parameters which also directly optimizes a likelihood ratio criterion [14]. There have been previous attempts at estimating model parameters for word spotting by maximizing a likelihood ratio criterion [13], [18]. There has also been more recent work in applying this class of techniques to more constrained tasks [12], [17]. The goal of the training procedure that is described here is to increase the average value of LR for correctly hypothesized vocabulary words and decrease the average value of LR for false alarms. An iterative discriminative training algorithm is used which increases the average likelihood ratio for correctly hypothesized vocabulary words and decreases the average likelihood ratio for false alarms. Finally, the third short-coming of UV techniques is their susceptibility to modeling inaccuracies that are associated with HMM acoustic models. It is often the case that local mis-matches between the speech utterance and HMM model can have a catastrophic effect on the accumulated score used for making a global decision. To mitigate these effects, several word level confidence measures corresponding to non-linear weightings of frame and segment level confidence measures are investigated. In order to motivate the work described in this paper, it is important to provide some minimal understanding of the context in which utterance verification techniques are applied. It is often the case that human–machine interfaces are configured so that a large percentage of the input utterances are ill-formed. This is the case for user-initiated human–machine dialog [8], [20] automation of telecommunications services [19], and is certainly true in case of machine interpretation of human-human dialog [5], [14]. Utterance verification in this context implies the ability to detect legitimate vocabulary words in an utterance that is assumed to contain words or phrases which are not explicitly modeled in the speech recognizer. However, even when input utterances tend to be well-formed and contain relatively few out-of-vocabulary (OOV) words, UV techniques can be applied to determine when decoded word hypotheses are correct. These procedures have been shown to improve performance in a number of applications where OOV utterances are relatively rare including telephone based connected digit and command word recognition [1], [11], [16]. The UV techniques described in this paper are applied to a spontaneous speech database query task over the public switched telephone network. Callers can query the system for information concerning movies playing at local theaters. The spoken queries contained a mixture of well-formed and ill-formed utterances. Therefore, UV performance as described here will describe the ability of UV techniques to not only reject out-of-vocabulary speech but also to detect speech recognition errors. The organization of the paper is as follows. Section II describes the utterance verification strategies and introduces the concept of likelihood ratio decoder and a set of word level confidence measures. A training algorithm based on the likelihood
127
ratio decoder is presented in Section III. Section IV describes the telephone based speech dialog task and the associated speech corpus used in the experiments. Finally, Section V describes an experimental study of utterance verification performance when the techniques described here are applied to the telephone based speech dialog task. II. UTTERANCE VERIFICATION The purpose of this section is to describe a decoding algorithm which optimizes a likelihood ratio criterion. The decoding algorithm is designed to find the best path through a combined space of target and alternate hypothesis HMM model states, and takes the form of a modified Viterbi algorithm. The section has five parts. First, the utterance verification paradigm that is used in this work is described. This includes the definition of the LR test as applied in continuous speech recognition (CSR) and UV performance measures derived under the Neyman–Pearson hypothesis testing formalism. Second, the single-pass utterance verification procedure is introduced and contrasted with the two-pass approach. The third and largest part of this section derives the expression for the decoding algorithm in a manner similar to the derivation of the Viterbi algorithm for maximum likelihood HMM decoding. Fourth, more efficient versions of the decoding procedure are derived by imposing constraints on the relationship between the target hypothesis model and alternate hypothesis model state sequences. Finally, a set of word level confidence measures are defined. A. Utterance Verification Paradigms It is assumed that the input to the speech recognizer is a representing sequence of feature vectors a speech utterance containing both within-vocabulary and out-of-vocabulary words. Borrowing from the nomenclature of the signal detection literature, the within-vocabulary words will be referred to here as belonging to the class of null or “target” hypotheses and the out-of-vocabulary words will be referred to as “imposters” or belonging to the class of alternate hypotheses. Incorrectly decoded vocabulary words appearing as substitutions or insertions in the output string from the recognizer will also be referred to as belonging to the class of alternate hypotheses. It is also assumed that the output of the recognizer is a single word string hypothesis of length Of course, all the discussion in this section can be easily generalized to the problems of verifying one of multiple complete or partial string hypotheses produced as part of an N-best list or word lattice as well. Under the assumptions of the Neyman–Pearson hypothesis testing framework, both the target hypothesis and alternate hypothesis densities are assumed to be known. In the context of UV, it will be assumed that the target or correct hypothesis model and the alternate model corresponding to a hypothesized vocabulary word are both hidden Markov models. A LR based hypothesis test can then be defined as: null hypothesis, was generated by the target model alternative hypothesis, was generated by the alternative model (1)
128
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoder are verified in a second stage using a likelihood ratio test.
where τ is a decision threshold. Given the target hypothwhich models correctly decoded esis probability hypotheses for a given recognition unit and the alternate which models the incorrectly hypothesis probability decoded hypotheses, (1) describes a test which accepts or correrejects the hypothesis that the observation sequence sponds to a legitimate vocabulary word by comparing the LR to a threshold. As the model parameters for a particular unit are not known, they have to be estimated from the training data is known. assuming the form of the density While the hypothesis testing theory provides guidance for choosing the value for the threshold τ and estimating the probability of error given the probability densities, it does not provide any guidance for dealing with a number of problems that are associated with speech recognition. For example, the observation vectors might be associated with a hypothesized word that is embedded in a string of words. In this case, it is not clear which observation vectors in the continuous utterance should be associated with the hypothesized word. This is a problem that is associated with two-pass utterance verification and is addressed by the decoding algorithm described in this section. An additional problem is the fact that language models are used in CSR as an additional knowledge source to constrain network search. The effect of the language model is to reduce the average branching factor at each decision node in the search procedure. As a result, a word may be correctly decoded due to a low branching factor at a given decision node even if the acoustic is relatively weak. In those cases where probability, a correct word hypothesis is decoded on the strength of the language model despite a weak acoustic model, the hypothesis test in (1) may prove misleading. This is an important problem which is not directly addressed by the work described in this paper. The use of word error rate or word accuracy for measuring performance in ASR assumes that the all input words are from within the prespecified vocabulary and that all words are decoded by the recognizer with equal confidence. The measures that are used here to evaluate utterance verification procedures are borrowed from the signal detection field. The most familiar of these measures is the receiver operating characteristic curve (ROC). There are two types of errors associated with utterance verification systems. Type I errors or false rejections correspond to words that are correctly decoded in the recognizer but rejected by the utterance verification process. Type II errors or
false alarms correspond to incorrectly decoded word insertions and substitutions which are generated by the recognizer and also accepted by the utterance verification system. Sweeping out a range of threshold settings in (1) results in a range of operating points where each operating point corresponds to a different trade-off between type I and type II errors. The ROC corresponds to a plot of the type I versus type II errors plotted for threshold settings Another performance measure is used in Section V to describe the sensitivity of the utterance verification procedure with respect to the setting of the threshold τ in (1). This is a plot of the sum of type I and type II errors plotted versus τ. B. Utterance Verification Procedures 1) Two-Pass Procedure: Utterance verification can be applied as a procedure for verifying whether the observation vectors in an utterance associated with individual word hypotheses generated by a speech recognition decoder correspond to the hypothesized word label. In this case, speech recognition and utterance verification are considered to be two separate procedures. For a continuous utterance, maximum likelihood decoding relies on a set of “recognition” hidden Markov models to produce a sequence of hypothesized word labels and hypothesized word boundaries. Utterance verification is then applied as a second pass, applying a hypothesis test to the word hypotheses produced by the decoder. This procedure is summarized in Fig. 1 and will be referred to here as “two-pass” utterance verification. In general terms, the hypothesis test is based on the computation of the likelihood ratio measure given in (1); however, several empirically determined measures have been applied to improve utterance verification performance. Several of these “confidence measures” are described in Section II-E. These confidence measures correspond to different ways of integrating local likelihood ratios and are applied in Section V to both the two-pass and the one-pass utterance verification strategies. 2) One-Pass Procedure: There is no fundamental reason why the processes of generating a hypothesized word string and generating confidence measures for verifying hypothesized words must be performed in two stages. Both processes could be performed in a single stage if the ASR decoder was designed to optimize a criterion related to the confidence measure used in UV. It will be shown in this section that one can easily derive a decoding algorithm that optimizes a LR criterion, and as result, provide LR based word level confidence scores directly. Fig. 2
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
129
Fig. 2. One-pass utterance verification where the optimum decoded string is that which directly maximizes a likelihood ratio criterion.
shows a simple block diagram illustrating how the functions of decoding and verification are integrated into a single procedure. C. Likelihood Ratio Decoder and an observation sequence it Given an HMM model is well known that the Viterbi algorithm can be used to solve the Maximum Likelihood decoding problem. This simply means that there is a straight-forward mechanism for obtaining the which maximizes the HMM state sequence likelihood of the observations with respect to the model, or (2) There are many tutorial references on HMMs that can be consulted for a discussion of issues related to HMM decoding and training including [10]. It is shown here that, using similar reasoning, one can derive a mechanism which represents a solution in (1) are to the likelihood ratio decoding problem. If and both HMMs, then the LR in (1) can be rewritten as
(3)
through the model space A discrete state sequence and can be written as defined by
Fig. 3. A possible three-dimensional HMM search space. An example of an optimum state sequence according to the criterion given in (5) is shown as a sequence of indices in the combined target model–alternate model state lattice.
probability of state emitting observation vector at time The HMM parameters also include the state transition from HMM state to state and the initial probabilities state probabilities The process of identifying the optimum state sequence in the combined target hypothesis model and alternate hypothesis model state space is illustrated by the diagram in Fig. 3. The optimal state sequence can correspond to an arbitrary path in this space under the constraints that it originates from one of a set and terminates in a final state at of initial states at time Fig. 3 also illustrates the case where the optimum time target and alternate hypothesis model state sequences are identified separately as is the case in a two-pass approach. In this case, two independent search procedures are performed through the and two dimensional state space lattices along the planes in Fig. 3. It is clear that the likelihood ratio decoding provides a much larger space of possible solutions at the expense of an exponential increase in computational complexity. A Viterbi-like decoding algorithm can be derived to solve for the state sequence in (6). This follows immediately after defining the following parameters from (6):
(4) (7) where in (4) is the number of observations in the utterance is a pair of integer indices representing the and HMM state at time The likelihood ratio decoding problem thus corresponds to that maximizes the likelihood obtaining the state sequence and or ratio of the data with respect to (5)
(6)
(8)
(9) The expression for finding the best path through the combined state space lattice shown in Fig. 3 can be obtained from (6) in terms of these new parameters as (10)
Equation (6) expands the observation probability in terms of the HMM parameters. These parameters include the observawhich correspond to the tion densities
is referred to here as the frame likeliwhere hood ratio. Equation (10) provides a way to define a one-pass
130
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
search algorithm for recognition/verification. Following the same inductive logic that is used in the Viterbi algorithm, we as the highest probability can define the quantity obtained for a single state sequence through our three-dimensional lattice over the first observations and terminating in target model state and alternate hypothesis model state Just as in the standard Viterbi algorithm, this probability can be calculated by induction
(11) and are the number of states in the target and where alternate hypothesis models, respectively. This new criterion allows information concerning the alternative hypothesis to be introduced directly in the decoder. However, some constraints on the search space must be imposed in practice to achieve manageable computational complexity. D. Efficient LR Decoder There are two issues that must be addressed if the LR decoding algorithm of the previous section is to be applicable to actual speech recognition tasks. The first issue is computational complexity. The unconstrained search procedure as specdimensional ified by (11) implies a search through a state space representing an increase in complexity of a factor over ML decoding. The following discussion suggests of that constraints can be applied to the relationship between target hypothesis model and alternate hypothesis model state indices. The result of this discussion is the definition of a more efficient decoding algorithm which requires evaluating a far smaller number of state sequences than would be evaluated in (11). The second issue is the definition of the alternate hypothesis model. There are many hazards associated with using a LR criterion in a decoder. We have to be concerned with the wide dynamic range that is generally associated with any likelihood ratio. It is also important that the alternate hypothesis model provide a good representation of imposter events in the decoder. These issues are discussed below. 1) LR Decoding Constraints: The implications of some practical decoding constraints for reducing the computational complexity of the likelihood ratio decoder are discussed here. The following are examples of simple constraints that can be applied: Unit Level Constraint — For each unit corresponding to a sub-word or word model, the target and alternate model components of the three dimensional path can both be constrained to start at a common unit start time and end at a common unit end time. This is equivalent to saying that the target model and alternate model must occupy their unit initial states and unit final states at the same time instants. With this constraint, the uncon-
Fig. 4. An example of how state level constraints, which constrain the relationship between target and alternate hypothesis model state sequences, may be applied to reducing the complexity of the likelihood ratio decoder.
strained state sequence illustrated in Fig. 3 can be segmented into concatenated unit level state sequences: (12) (13) corresponds to the state sequence for unit where State Level Constraint — State level constraints can also be imposed to further restrict the relationship between the target and alternate model state sequences for a given unit. This results in a further reduction in the number of candidate sequences that need be evaluated during search. In posing these constraints, it is assumed that the topology of HMM models and remains fixed. However, the network search can be constrained so that a correspondence exists between the decoded target model and alternate model state indices, and at time
(14) and are constants which define the dimensions of where the search window. Fig. 4 illustrates how the state level constraints of (14) can reduce the complexity of the search procedure when Equation (14) states that given a time and state index the next state index in the plane must be equal or greater than the previous one in the plane but not greater than the state in the plane plus an offset The same condition remains for the next state in the plane but with an offset of The vertices of the shaded rectangles in Fig. 4 show the allowed states in the three dimenStarting at time in the state sional path when (1,1), the allowed transitions are (1,1),(1,2),(2,1) and (2,2). Supthe transition is to the state (2,1), in this pose that at time
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
case the allowed transitions from this state, applying (14), are (2,1), (2,2), and (2,3). In this way, the maximum difference between state index is one.State-level constraints can be defined which result in a minimal extra decoder complexity beyond that and are HMMs with of the ML decoder. Suppose that identical topologies, and the state constraint is defined so that for each In this way, the single optimum state sequence is decoded applying the state constraint to (6) (15) As a result, the process of identifying the optimum state sequence can take the form of a modified Viterbi algorithm where the recursion at the frame level is defined as
(16) obtained in the Viterbi The accumulated path score, algorithm corresponds to a measure of confidence in the path at time Note that (16) is equivaterminating in state lent to the standard Viterbi algorithm applied to a modified initial HMM with transition probabilities and observation densities state probabilities The algorithm of (16) has been used in the experiments described in Section V for two purposes. The first is to obtain string hypotheses decoded according to a likelihood ratio criterion. The second purpose is to “rescore” utterance segmentations obtained from a maximum likelihood decoder for the purpose of providing improved confidence measures. 2) Definition of Alternative Models: The alternative hypothesis model has two roles in utterance verification. The first is to reduce the effect of sources of variability on the confidence measure. If the probabilities of the null hypothesis model and the alternative hypothesis model are similarly affected by some systematic variation in the observations, then forming the likelihood ratio should cancel the effects of the variability. The second role of the alternate model is more specifically to represent the incorrectly decoded hypotheses that are frequently confused with a given lexical item. The issues relating to how the structure of the alternate hypothesis model is defined and how the parameters of the model are estimated has yet to be addressed. The structure used for the alternate model in this work is motivated here, and the techniques used for estimating alternate model parameters are described in Section III. It is important to note that, since the single-pass decoding procedure produces an output string which maximizes a likelihood ratio, the alternate hypothesis model affects not only the hypothesis test performed as part of the utterance verification procedure, but also the accuracy of the decoded string. If there is any portion of the acoustic space that the alternate model does not describe well, speech recognition performance can be reduced. As in any hypothesis testing problem, it is difficult to give precise guide lines for choosing the form of the alternate hypothesis model when the true distributions are unknown. However, one can say that the following should be true. First, the alternate
131
hypothesis distribution must somehow “cover” the entire space of out-of-vocabulary lexical units. Observation vectors corresponding to out-liers of the alternate hypothesis distribution can result in dramatic swings of the LR which could in turn result in decoding errors. Second, if OOV utterances that are easily confused with the vocabulary words are to be detected, the alternate hypothesis model must provide a more detailed representation of the utterances that are likely to be decoded as false alarms for individual vocabulary words. One way to satisfy the conditions outline above for the alternate model is to define the alternate hypothesis probability as a linear combination of two different models (17) and are linear weights. The purpose of where referred here as background alternative model, is to provide a broad representation of the feature space. This broad representation serves to reduce the dynamic range of the likelihood ratio, to satisfy the condition mentioned above and allows of “covering” the larger space of input utterances. In the experiments described in Section V, a single background alternate hypothesis model is shared amongst all “target” HMM models. referred to here as an imposter alternate The purpose of hypothesis model, is to provide a detailed model of the decision regions associated with each individual HMM. Imposter models are trained to represent individual words or sub-word units thus provides a detailed satisfying the above condition that representation of errors associated with particular words. E. Confidence Measures The issue of modeling inaccuracies associated with HMM acoustic models was alluded to in Section I. It was suggested there that modeling errors may result in extreme values in local likelihood ratios which may cause undo influence at the word or phrase level. In order to minimize these effects, we investigated several word level likelihood ratio based confidence measures that can be computed using a non-uniform weighting of sub-word level confidence measures. These word level measures can be applied to the likelihood ratio scores produced during the second pass of the two-pass UV procedure in Fig. 1 or to the scores obtained directly from the one-pass procedure illustrated in Fig. 2. and be the confidence measure for phonetic Let unit and word , respectively. A sentence is composed where is the of a sequence of words th word in and a word is composed of a phonetic basewhere is the th sub-word form A sub-word unit based confidence measure can be unit in computed from the likelihood ratio between the correct and alternative hypothesis models. The unweighted unit level likelifor a unit decoded over a segment hood ratio score can be obtained directly from the LR decoder or computed from the second pass of the two pass decoder as
(18)
132
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
where is the number of frames in the decoded segment. can exhibit a wide However, as with any likelihood ratio, is used to form a dynamic range which is undesirable if larger word level score. One way to limit the dynamic range of the sub-word confidence measure is to use a sigmoid function (19) where defines the slope of the function and is a shift. Using the sigmoid, the range of the sub-word confidence measure is compressed to fall within the interval [0,1]. While the notation in (19) suggests that a separate function could be defined for each and only a single sigmoid unit by specifying separate is used for the experiments described Section V. In this work, it is assumed that the system produces a decoded word string where a confidence measure is assigned by the system to each hypothesized word. There are many different ways that word level confidence measures can be formed by combining sub-word confidence scores. It is important to prevent extreme values of the sub-word level likelihood ratio scores from dominating the word level confidence measures. These extreme values can often occur when a good local match has occurred in the decoding process even when the overall utterance has been incorrectly recognized. As a result, the best word level confidence scores tend to assign reduced weight to those extremes in sub-word level scores. Word level confidence measures corresponding to the arithmetic and geometric means of both the unweighted unit level and the sigmoid weighted unit likelihood ratio scores, are compared. The following level likelihood ratio scores composed of sub-word units measures are defined for a word
a min for values of
and approximates a max for values of
III. LIKELIHOOD RATIO-BASED TRAINING A training technique is described for adjusting the parameand in the likelihood ratio test to maximize a criteters rion that is directly related to (1). The notion behind the method is that adjusting the model parameters to increase this confidence measure on training data will provide a better measure of confidence for verifying word hypotheses during the normal course of operation for the service. The goal of the training profor corcedure is to increase the average value of and decrease the average value of rect hypotheses for false alarms This section begins by deriving a log likelihood ratio based cost function for estimating model parameters. Update equations for the model parameters obtained by optimizing this cost function using a gradient descent procedure are derived. Finally, issues relating to the implementation of the training procedure are discussed. In particular, since the estimated models will be used in the single pass decoding/verification strategy, we discuss how the estimation procedure is based on this one-pass strategy. A. Likelihood Ratio Based Cost Function LR based training is a discriminative training algorithm that is based on a cost function which approximates a log likelihood ratio. In practice, the distance measure which underlies the cost function is approximated for a hypothesized unit by first obtaining an optimum sequence of states using the single pass decoding strategy under the assumptions used in obtaining (16). A frame based distance defined for each state transition in the is given by sequence
(20)
(21)
(25) The segment based distance is obtained by averaging the frame based distances as (26)
(22)
and are the initial and final frame of the where over the segment speech segment decoded as unit
(23)
A gradient descent procedure is used to iteratively adjust the model parameters as new utterances are presented to the training procedure [14]. A smooth cost function corresponding to the unit level distance in (26) that is differentiable with respect to our model parameters is required [7]. Define the cost function for unit using a sigmoid function
and are the arithmetic and geometric means of the where and unweighted sub-word level confidence scores, and are the arithmetic and geometric means of the sigmoid weighted sub-word score [9]. It was noted in anecdotal experiments that the geometric average tended to have a similar effect as the confidence measure
(27) where the indicator function
is defined as
(24) defined in [7]. This corresponds to a weighted average of the inverse of the sub-word confidence scores which approximates
and defines the slope of the function and is a shift. The discriminative training procedure based on the cost function in
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
(27) is implemented in four steps. First, one or more unit hypotheses are generated along with the associated unit endpoints by the single-pass decoder. Second, the unit hypothesis decoded for an utterance is labeled as corresponding to an actual occurrence of a vocabulary word (true hit) or a false alarm. Third, the cost function given in (27) is computed using the probabilities estimated from the target and alternate hypothesis models. Finally, a gradient update is performed on the expected cost
133
ment of speech the gradient of the cost funccan be written as tion with respect to an HMM parameter
(30) (28) is the model computed at the th where update of the gradient descent procedure, and is a learning rate constant. The gradient in (28) is taken with respect to the model parameters that are to be re-estimated. In the experiments described in Section V, continuous mixture observation density HMM’s are used in recognition. The set of re-estimated parameters includes Gaussian means, variances, and mixture weights for both the target and alternate hypothesis HMMs. The expectation in (28) is approximated as the average cost computed over all occurrences of the unit in the training data (29)
dictates the direction of the The indicator function gradient depending on the whether the decoded observations were correctly or incorrectly decoded and whether the re-estiis associated with the target or the alternate mated parameter hypothesis model (31) It will be assumed here that the target and alternate hypothesis models are both continuous mixture observation density in (30) represent HMMs. Hence, the observation densities the complete set of parameters to be re-estimated in the gradient update procedure (32)
is the number of occurrences of the unit in the where training set. The average cost function is a soft count of the number of Type I and Type II errors assuming that the decision threshold is Imposters with scores greater than τ (type II) and targets with scores lower that τ (type I) tend to increase the average cost function. Therefore, if we minimize this function we can reduce the misclassification between targets and imposters. The average cost function is minimized by the training procedure which means that the average sub-word confidence measure is increased for correct hypotheses and is decreased for false alarms of the unit It is well known that the gradient of the cost function with is cenrespect to the segment level score, tered about As a result, only speech segments with segment level scores in the vicinity of τ will have a significant effect on updating parameters in the gradient update procedure. For segments that are very far from the decision boundary, the magnitude of the gradient will be very small resulting in a negligible impact on the re-estimated parameters. B. HMM Parameter Update Equations The gradient update equations are derived for the likelihood ratio training procedure outlined in Section III-A. We begin by writing the expression for the gradient of the expected cost given where in (29) with respect to a generic model parameter refers to the th HMM parameter to be updated (mixture weights,means and variances), and the index refers to target or alternative hypothesis HMM models. Then, update equations will be derived for each individual model parameter that is associated with the observation densities for both the target and alternate hypothesis HMM models. Given a seg-
where (33) Gaussian densities. In (32), is the number is a mixture of of states and the th Gaussian mixture component is charactera diagonal covariance matrix ized by a mean vector and a mixture weight There are a variety of gradient based HMM model reestimation procedures in the literature, and the reestimation equations that follow are similar to those derived as part of discriminative training procedures [2]–[4]. The gradient and is taken with respect to the transformed parameters, and the mean values, where the transformations are (34) (35)
Hence, the partial derivatives can be obtained by inserting (33) into (30).
needed in (30)
(36) (37)
(38)
134
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
where (39) The form of the alternate hypothesis densities was motivated in Section II-D2. Following the example of (17) where the alternate hypothesis distribution was expressed in terms of both a unit specific imposter distribution and a more general background distribution, we derive update equations for the linear weights that appear in (40) Without loss of generality, both of the weights in (40), and could be state dependent parameters. To compute the gradient of the observation probabilities with respect to the weights, they must transformed to satisfy the stochastic conThe transformations are straint (41)
(42) can then be obtained with respect to The gradient of the transformed weights as
(43)
(44) C. Training Procedure The training procedure involves successive updates to the as target and alternative hypothesis models, given in (28) based on the estimation of the gradient of the expected cost. This subsection describes how this gradient update procedure is initialized by obtaining initial ML estimates of the model parameters. It also describes how the decoder is invoked at each iteration of the procedure to provide segmentation information for the computation of the underlying cost functions. Initial ML estimates of the target hypothesis HMM models, are obtained using the segmental k-means training algorithm. The procedure that is followed here for obtaining the is deinitial estimates of the alternative hypothesis model, scribed as follows. First, ML training of the background model, is performed using all of the utterances in the training set. Second, hypothesized segments along with their segmentation boundaries are obtained using the modified Viterbi algorithm serving as the initial alternative hypothgiven in (28) with esis HMM model. Finally, initial ML estimates of the imposter are trained from the decoded sub-word units that models, were labeled as being insertions and substitutions in the training data.
The complete likelihood ratio based training procedure can be outlined as follows. 1) Train initial maximum likelihood HMMs, and for each unit. 2) For each iteration over the training data base: • For each utterance: Obtain hypothesized sub-word unit string and segmentation using the LR decoder Align the decoded unit string with the correct string. Label each decoded sub-word unit as correct or false alarm, to obtain indicator function in (31). Update gradient of expected cost, • Update the model parameters, in (28). Of course, the motivation for using the LR decoder in the training procedure is so that the same optimization criterion is used in both parameter reestimation and decoding. Anecdotal experiments have suggested that better performance is achieved using the LR decoder in training over using an ML decoder provided that a reasonable model initialization strategy is used. It is very difficult, however, to make stronger statements beyond this empirical evidence as to the “optimality” of using the LR decoder in training. IV. SPEECH CORPORA: MOVIE LOCATOR TASK The decoding and training procedures that were described in Sections II and III were evaluated on a limited domain continuous speech recognition task. The task was a trial of an ASR based service which accepted spoken queries from customers over the telephone concerning movies playing at local theaters [21]. This task, referred to as the “movie locator”, was interesting for evaluating utterance verification techniques because it contained both a large number of ill-formed utterances and also contained many out-of-vocabulary words. This section briefly describes the movie locator task and the evaluation speech corpus that was derived from it. The following carrier phrases describe the general classes of queries that are accepted. The italicized words represent semantic class labels which represent anywhere from several dozen to several hundred values that can change over time. The system was also configured to accept queries that contain additional qualifiers such as time restrictions, e.g., “this afternoon”, and spontaneous speech, e.g., “um I would like to know”. The example carrier phrases are as follows. 1) What’s playing at the theater_name? 2) What’s playing near city? 3) Where is movie_title playing near city? 4) When is movie_title playing at the theater_name? 5) When is movie_title playing near city? 6) What movie_category are playing at the theater_name? 7) What movie_category are playing near city? 8) Where is theater_name located? 9) Where is movie_title playing at a chain_name near city? An example of complex query accepted by the service is “Yes, I would like to find out what movie is being shown at, um, the Lake Theater in Oak Park, Illinois.”
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
In a trial of the system over the public switched telephone network, the service was configured to accept approximately 105 theater names, 135 city names, and between 75 and 100 current movie titles. A corpus of 4777 spontaneous spoken utterances from the trial were used in our evaluation. A total of 3025 sentences were used for training acoustic models and 752 utterances were used for testing. The sub-word models used in the recognizer consisted of 43 context independent units. Recognition was performed using a finite state grammar built from the specification of the service, with a lexicon of 570 different words. The total number of words in the test set was 4864, where 134 of them were out-of-vocabulary. There were 275 sentences in the test set that could not be parsed by the finite state grammar. Of these 275 sentences, 85 of contained out-of-vocabulary words and the rest were not explained by the grammar. Recognition performance of 94.0% word accuracy was obtained on the “in-grammar” utterances. While the results described in Section V are presented primarily in terms of word verification performance, it is clear from the data that the ability to reject invalid queries is also important. The feature set used for recognition included 12 mel-cepstrum, 12 delta mel-cepstrum, 12 delta-delta mel-cepstrum, energy, delta energy and delta-delta energy coefficients. Cepstral mean normalization was applied to the cepstral coefficients to compensate for linear channel distortions. V. EXPERIMENTAL RESULTS An experimental study was performed to evaluate the effectiveness of both the likelihood ratio decoding procedure described in Section II and the likelihood ratio based training procedure described in Section III. This section describes the results obtained from this study, and is composed of three parts. First, several methods for computing word level confidence measures from local likelihood ratio scores are compared. A reasonable definition of this confidence measure is important for both the one-pass and two-pass methods used here for utterance verification. Second, the utterance verification performance of the one-pass and two-pass utterance verification paradigms are compared. We have not attempted to make direct comparisons between systems that were implemented using ML based decoding and LR based decoding without also considering model training scenarios. Comparisons are made, however, between the UV performance for the one-pass procedure and that of the two-pass procedure when HMM models are trained using both ML and LR based training. Finally, the effect of the likelihood ratio training procedure on utterance verification performance is investigated. The experimental evaluation of all decoding procedures, training algorithms, and utterance verification techniques is performed using the speech corpus derived from the movie locator task described in Section IV. For all the experiments, the speech recognizer is configured using forty-three context independent HMM models. Each model contains three states with continuous observation densities and a maximum of 32 Gaussian mixtures per state. Tri-phone based HMM units were also evaluated on this task, but provided only a 16% reduction in word error rate. This was thought to be because of the relatively small number of utterances available for training.
135
Fig. 5. Word detection operating curve using a geometric mean combination (solid line) and arithmetic mean (dash line).
These recognition models correspond to the “target” hypothesis models discussed in the context of utterance verification in Section II-C. It is important to note that all of the systems implemented according to the two-pass UV scenario described in Fig. 1 use this exact set of recognition HMM models for the first pass CSR decoder. LR trained target and alternative hypothesis HMM models are only used in the second UV stage of the two-pass procedure. The alternate hypothesis models are configured according to the definition given in (17). A single “background” HMM alcontaining three states with 32 ternate hypothesis model, mixtures per state was used. The background model parameters were trained using the ML segmental k-means algorithm from the entire training corpus. A separate “imposter” alternate hywas trained for each sub-word unit. pothesis HMM model, These models contained three states with eight mixtures per state and were initially trained using the ML segmental k-means algorithm from insertions and substitutions decoded from the training corpus. Subsequent training of all model parameters using the likelihood ratio training procedure is discussed below. A. Comparison of UV Measures Word level utterance verification performance is compared for different definitions of the confidence measures given in (20)–(23). All of the results described in this subsection are based on the two-pass utterance verification paradigm where the hypothesized word strings are passed from the speech recognizer to the utterance verification system. Performance is described both in terms of the receiver operating characteristic curves (ROC) and curves displaying type I + type II error plotted against the decision threshold settings. The motivation for these performance criteria were given in Section II-A. Fig. 5 shows the ROC curves obtained when the word level confidence score is formed using an arithmetic average, and a geometric average, of the unweighted sub-word Since the geometric mean tends to confidence scores emphasize the smaller valued sub-word level scores, the UV performance is better than that obtained using the arithmetic mean.
136
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
W
W
Fig. 6. Word detection based receiver operating curves and type I + type II error curves comparing performance of confidence measures using the arithmetic mean, (w ); (dashed line) and the geometric mean, (w ); (solid line) of the sigmoid weighted sub-word level confidence measure.
Fig. 6 shows both the ROC curves and the type I + type II error curves when a sigmoidal weighting function is applied to the sub-word level confidence scores, and the word level confidence scores are obtained from the sub-word level scores using or the geometric mean, In the arithmetic mean, both cases, the sigmoid weighting in (19) is parameterized using and It is clear from Fig. 6 that when the sigmoidal weighting is applied, the ROC curves are similar for both the arithmetic average and the geometric average. This suggests that the sigmoid performs the same function as the geometric averaging in limiting the effects of extreme values in the likelihood ratio scores. However, it appears from the error plot in Fig. 6 that the confidence measure based on the geometric mean is less sensitive to the setting of the confidence threshold. In the remaining simulations described in this section, the geometric mean of the sigmoid weighted sub-word level confidence will be used. measures, B. Comparison of One-Pass Decoder with Well-Known UV Procedures This section compares the one-pass utterance verification and decoding procedure to other more well known procedures. The one-pass UV is compared to an often used two-pass UV procedure where word level confidence measures are computed using a dedicated network of phonemes to obtain the alternate hypothesis likelihood. Fig. 7 provides a comparison of the two-pass utterance verification performance for several different definitions of the alAs a point of reference, the first ternate hypothesis model is used, but inexample corresponds to the case where no stead the simple duration normalized likelihood score is used:
(46) corresponds to the sequence of observation vectors where that were decoded for beginning at frame and ending at This is displayed as the dash-dot curve in Fig. 7. The word second example corresponds to a “phone-net” likelihood ratio score which can be defined as (47)
Fig. 7. Comparison among three UV procedures: one-pass UV (solid line), phone-net (dashed line), and duration normalized likelihood (dot-dash line).
where corresponds to an unconstrained network of the 43 context independent phones that are in the network. This is displayed as the dashed curve in the figure. Finally, the performance of the one-pass utterance verification using target/alternative models is given as the solid line in Fig. 7. The target and alternate hypothesis models used by the one-pass UV are updated by applying two iterations of the likelihood ratio training procedure described in Section III-C. The likelihood ratio decoding algorithm in (16) is used for obtaining the optimum word string, and the confidence measure is computed directly from the scores obtained from the decoder. This section has two goals. The first is to investigate the effects of the likelihood ratio training procedure on UV performance. The second goal is to investigate the relative performance of the one-pass UV procedure with respect to the two-pass procedure when identical definitions of the target and alternative hypothesis models are used for both procedures. While the one-pass UV and decoding procedure has clear advantages over two-pass procedures in terms of simplicity of control structures and system delay, it still remains to show how the performance of a one-pass UV system compares to that of a similarly configured two-pass system. In preliminary
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
137
TABLE I UTTERANCE VERIFICATION PERFORMANCES: type I type II MINIMUM ERROR RATES FOR THE ONE-PASS (OP) AND THE TWO-PASS (TP) UTTERANCE VERIFICATION PROCEDURES. b%% NUMBER OF MIXTURES FOR THE BACKGROUND MODEL, i% NUMBER OF MIXTURES FOR THE IMPOSTER MODEL
+
comparisons, it was determined that the relative performance of the two procedures was heavily dependent on the criterion used for training the alternate hypothesis HMM models. This observation is made more precise by the experimental comparison whose results are summarized in Table I. It is clear from Fig. 7 that using the absolute likelihoods gives clearly inferior performance to likelihood ratio based confidence measures. This was already well known. The phone-net based utterance verification procedure performs reasonably well but suffers from the computational expense of running an unconstrained network of phones. It is interesting to note that in terms of utterance verification performance, recognition performance and computational efficiency, the combination of likelihood ratio training and target/alternative HMM models with the one-pass UV decoding strategy out-performs a fairly commonly used procedure for UV. C. Investigation of LR Training and UV Strategies The entries in Table I correspond to the minimum type I + type II error when computed over all possible settings of the decision threshold. An example of this type of plot is given in Fig. 6, and the use of a single minimum value in Table I is dictated by the need to summarize overall performance so that cross-system performance comparisons can be made. The minimum type I + type II error rates are displayed for the one-pass (OP) and the two-pass (TP) strategies. The first row in Table I corresponds to use of the initial maximum likelihood trained HMMs, and the second and third rows correspond to models trained using one and two iterations of the likelihood ratio training procedure, respectively. The three major columns are each labeled according to the number of component densities composing the and the imposter albackground alternate hypothesis model in (17). For example, the column ternate hypothesis model labeled as b32.i4 corresponds to a alternate hypothesis model with 32 component Gaussians and an imposter model with four component Gaussians. It is also important to note that all of the results for teh two-pass procedure are based on a system which uses the original “recognition HMM models” as indicated in Fig. 1 and described in Section IV for the ML based CSR decoder used in the first pass. There are three points that can be made from the results displayed in Table I. The first point is that likelihood ratio training is very important for obtaining good performance from the one-pass likelihood ratio decoder. This is evident from the significant performance improvements listed under “OP” in the table in going from maximum likelihood training to like-
Fig. 8. Likelihood ratio training, ROC curves for initial models (dash-dot line), one iteration (dash line) and two iterations (solid line). The *-points are the minimum type I + type II error.
lihood ratio training for all parameterizations of the alternate hypothesis model. It is clear that the different parameterizations for the alternative hypothesis models yield a wide range of performance after the initial ML training. However, in all of the three cases shown, the LR training of the UV models results in UV performance that is not only significantly improved but also remarkably consistent after only two iterations of LR training are applied. The importance of the LR training in the one-pass decoder is also demonstrated by the ROC curves in Fig. 8. Separate curves are plotted to describe the UV performance for the one-pass decoder using zero, one, and two iterations of the LR training procedure. It is clear from Fig. 8 that there is a significant increase in performance over a wide range of operating points. The second point that can be made is the dependence of UV performance on the parameterization of the background alterThe purpose of the background native hypothesis model, model, as mentioned in Section II-D2, is to provide a broad representation of the feature space, reducing the dynamic range of the LR score. For the initial ML UV models, reducing the number of mixture densities from 32 to 16 results in a severe degradation in UV performance. Anecdotal obervations suggest that this is a result of the increased dynamic range caused by the The LR training directly comreduced parameterization of pensates for the effects of these artifacts that are associated with the LR, as is shown in Table I. The third point that can be made from Table I is that the one-pass UV procedure consistently out-performs the two-pass procedure when likelihood ratio training is used. A consistent level of performance is maintained over the range of model parameterizations that are displayed in Table I. The ROC curves in Fig. 9 also provide a comparison of UV performance for the one-pass and two-pass UV verification procedures. The curves are plotted for a single model parameterization where 32 mixtures per state are used for the target hypothesis models and eight mixtures per state are used for the alternate hypothesis models. For a single model parameterization, an improvement of approximately 7% in word detection performance is obtained for false alarm rates greater than 5%.
138
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000
TABLE II SPEECH RECOGNITION PERFORMANCE GIVEN IN TERMS OF WORD ACCURACY WITHOUT USING UTTERANCE VERIFICATION AND UTTERANCE VERIFICATION PERFORMANCE GIVEN AS THE SUM OF type I AND type II ERROR
Fig. 10. In grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line. Fig. 9. One-pass versus two-pass UV comparison with the b32.i8 configuration and two iterations of the likelihood ratio training.
VI. DISCUSSION There are many questions that still remain concerning the effect of the UV related training and decoding. We briefly address two of these issues. First, while the LR training procedures were compared in Section V in terms of utterance verification performance, it was not clear whether or not the LR training procedures actually improved speech recognition performance. Second, the performance in Section V was reported for a single test set which included both in-grammar and out-of-grammar utterances. It was not clear how the various techniques would perform when the in-grammar and out-of-grammar utterances were considered separately. In Table II, both speech recognition and UV performance are given for the test set that includes both in-vocabulary and out-of-vocabulary utterances. The first column of the table displays the word accuracy when no utterance verification is used, and the second column of the table displays the sum of the type I and type II error which is used to describe UV performance. The first row of Table II gives the performance with no likelihood ratio training and the second row displays performance after two iterations of likelihood ratio training. First, it is interesting to compare the word accuracy of 84.7% obtained for the combined in-vocabulary and out-of-vocabulary utterances with the 94.0% word accuracy reported in Section IV for the in-vocabulary utterances alone. Second, it is interesting to note from Table II that the word accuracy significantly improves after two iterations of LR training although not to the same degree as the UV performance. In Figs. 10 and 11 families of ROC curves are used to describe UV performance as measured over in-grammar and out-of-grammar utterances, respectively. In each of the figures,
Fig. 11. Out-of-grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line.
separate curves are used for systems configured using zero, one, and two iterations of LR training. It is clear from Figs. 10 and 11 that performance degrades dramatically for out-of-grammar utterances. We believe that the problem with out-of-grammar utterances is particularly severe in this task because of the relatively low perplexity of the language model as defined for this task. It is also clear that the effect of the LR training procedure is much more pronounced for the in-grammar utterances. VII. SUMMARY AND CONCLUSIONS We have presented novel decoding and training procedures for HMM utterance verification and recognition based on
LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION
the optimization of a likelihood ratio based criterion. The decoding algorithm allows word hypothesis detection and verification to be performed simultaneously in a “one-pass” procedure by defining the optimum state sequence to be that which maximizes a likelihood ratio, as opposed to a maximum likehood criterion. Associated with this decoding algorithm, a training procedure has been presented which optimizes the same likelihood ratio criterion that is used in the decoder. As a result, both training and recognition are performed under the same optimization criterion as is used for hypothesis testing. While a very general search algorithm has been presented, a more efficient algorithm taking the form of a modified Viterbi decoder is derived as a special case of this general procedure. Finally, a word confidence measure based on the same optimization function used in training has been proposed. Experimental studies were performed using utterances collected from a trial of a limited domain ASR based service which accepted continuous utterances from customers as queries over the public switched telephone network. It was observed on this task that the likelihood ratio training procedure could improve utterance verification performance by an amount ranging from 11% to 19% depending on model parameterization. It was also found that, when combined with likelihood ratio training, the one-pass decoding procedure improved UV performance over the more traditional two pass approach by as much as 11%. More recently, likelihood ratio training and decoding has also been successfully applied to other tasks including speaker dependent voice label recognition [15]. Further research should involve the investigation of decoding and training paradigms for utterance verification that incorporate additional, non-acoustic sources of knowledge. REFERENCES [1] J. M. Boite, H. Bourlard, B. D’hoore, and M. Haesen, “A new approach to keyword spotting,” in Proc. Eur. Conf. Speech Communications, Sept. 1993. [2] J. S. Bridle, “Alpha-neets: A recurrent neural network architecture with a hidden Markov model interpretation,” Speech Commun., vol. 9, no. 1, pp. 83–92, 1990. [3] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminative training for speech recognition applications,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 206–216, Jan. 1994. [4] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMM based speech recognizer,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, Apr. 1992, pp. 473–476. [5] S. Cox and R. C. Rose, “Confidence measures for the Switchboard database,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1996. [6] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, 1990. [7] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Processing, vol. 40, pp. 3043–3054, Dec. 1992. [8] E. Lleida and R. C. Rose, “Efficient decoding and training procedurs for utterance verification in continuous speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1996. [9] , “Likelihood ratio decoding and confidence measures for continuous speech recognition,” in Proc. Int. Conf. Spoken Language Processing, Oct. 1996. [10] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, 1989. [11] M. G. Rahim, C. H. Lee, and B. H. Juang, “Robust utterance verification for connected digits recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1995, pp. 285–288. [12] M. Rahim, C. Lee, B. Juang, and W. Chou, “Discriminative utterance verification using minimum string verification error training,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1996.
139
[13] R. C. Rose, “Discriminant word spotting techniques for rejecting non-vocabulary utterances in unconstrained speech,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, Mar. 1992. [14] R. C. Rose, B. H. Juang, and C. H. Lee, “A training procedure for verifying string hypotheses in continuous speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, Apr. 1995, pp. 281–284. [15] R. C. Rose, E. Lleida, G. W. Erhart, and R. V. Grubbe, “A user-configurable system for voice label recognition,” in Proc. Int. Conf. Spoken Language Processing, Oct. 1996. [16] R. A. Sukkar and J. G. Wilpon, “A two pass classifier for utterance rejection in keyword spotting,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, Apr. 1993. [17] M. Rahim, C. Lee, B. Juang, and W. Chou, “Utterance verification of keyword strings using word-based minimum verification error training,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1996, pp. 518–521. [18] C. Torre and A. Acero, “Discriminative training of garbage model for non-vocabulary utterance rejection,” in Proc. Int. Conf. Spoken Language Processing, June 1994. [19] J. G. Wilpon, L. R. Rabiner, C. H. Lee, and E. R. Goldman, “Automatic recognition of keywords in unconstrained speech using hidden Markov models,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, pp. 1870–1878, Nov. 1990. [20] S. R. Young and W. H. Ward, “Recognition confidence measures for spontaneous spoken dialog,” in Proc. Eur. Conf. Speech Communications, Sept. 1993, pp. 1177–1179. [21] J. J. Wisowaty, “Continuous speech interface for a movie locator service,” in Proc. Human Factors Ergon. Soc., 1995.
Eduardo Lleida (M’99) was born in Spain in 1961. He received the M.Sc. degree in telecommunication engineering and the Ph.D. degree in signal processing from the Universitat Politècnica de Catalunya (UPC), Spain, in 1985 and 1990, respectively. From 1989 to 1990 he was an Assistant Professor and from 1991 to 1993, he was an Associate Professor with the Department of Signal Theory and Communications, UPC. From February 1995 to January 1996, he was a Consultant in Speech Recognition with AT&T Bell Laboratories, Murray Hill, NJ. Currently, he is an Associate Professor of signal theory and communications with the Department of Electronic Engineering and Communications, Centro Politécnico Superior, Universidad de Zaragoza, Spain. His current research interests are in signal processing, particularly as applied to speech enhancement, speech recognition, and keyword spotting.
Richard C. Rose (S’84–M’88) received the B.S. and M.S. degrees in electrical engineering from the University of Illinois, Urbana, in 1979 and 1981, respectively, and the Ph.D. degree from the Georgia Institute of Technology, Atlanta, in 1988, completing his dissertation work in speech coding and analysis. From 1980 to 1984, he was with Bell Laboratories, Murray Hill, NJ, where he worked on signal processing and digital switching systems. From 1988 to 1992, he was a member of the Speech Systems and Technology Group at Lincoln Laboratory, Massachusetts Institute of Technology, Cambridge, working on speech recognition and speaker recognition. He is presently a Principal Member of Technical Staff, Speech and Image Processing Services Laboratory, AT&T Laboratories—Research, Florham Park, NJ. Dr. Rose served as a member of the IEEE Signal Processing Technical Committee on Digital Signal Processing from 1990 to 1995, and has served as an Adjunct Faculty Member of the Georgia Institute of Technology. He was elected as an At Large Member of the Board of Governers for the Signal Processing Society for the period from 1995 to 1998, and served as membership coordinator during that time. He is currently serving as an associate editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, and is also currently a member of the IEEE Signal Processing Technical Committee on Speech. He is a member of Tau Beta Pi, Eta Kappa Nu, and Phi Kappa Phi.