Topics on Minimum Classification Error Rate ... - Semantic Scholar

1 downloads 0 Views 84KB Size Report
to classification error rate minimization is established through a special loss function. ... tive methods has its root in the classical Bayes decision theory, it departs ...
Topics on Minimum Classification Error Rate Based Discriminant Function Approach to Speech Recognition Wu Chou Bell Labs., Lucent Technologies, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A., [email protected]

ABSTRACT In this paper, we study discriminant function based minimum recognition error rate pattern recognition approach. This approach departs from the conventional paradigm which links a classification/recognition task to the problem of distribution estimation. Instead, it takes a discriminant function based statistical pattern recognition approach and the goodness of this approach to classification error rate minimization is established through a special loss function. It is meaningful even when the model correctness assumption is known not valid. The use of discriminant function has a significant impact on classifier design, since in many realistic applications, such as speech recognition, the true distribution form of the source is rarely known precisely and without model correctness assumption, the classical optimality theory of the distribution estimation approach can not be applied directly. We discuss issues in this new classifier design paradigm and present various extensions of this approach for applications in speech processing.

1. INTRODUCTION The advent of powerful computing devices and the success of hidden Markov modeling triggered a renewed pursuit for more powerful statistical methods to further reduce the recognition error rate and build more robust speech recognition systems across various conditions. The use of discriminant function methods in speech recognition is a promising approach. Although the statistical formulation of minimum classification error (MCE) based discriminative methods has its root in the classical Bayes decision theory, it departs from the conventional paradigm which links a classification/recognition task to the problem of distribution estimation. Instead, it takes a discriminant function based statistical pattern classification approach, and for a given family of discriminant function, optimal classifier design involves finding a set of parameters which minimize the empirical recognition error rate. One classical example of using discriminant function for classifier design in statistical literatures is the two class classification problem using linear discriminant functions. In particular, a window based method was described in [22] for the two class classification problem using linear discriminant functions that minimize the probability of classification error. In this paper, the focus is on general MCE based discriminative methods. The discriminant functions are nonlinear and often related to the structure of the statistical framework used in speech recognition such as hidden Markov models. The reason of taking a discriminant function based ap-

proach to classifier design, as will be further elaborated, is due mainly to the fact that we lack complete knowledge of the form of the data distribution and that training data are always inadequate, particularly in dealing with speech problems. The performance of a recognizer is normally defined by its expected recognition error rate, and an optimal recognizer should be the one that achieves the least expected rate of recognition error. The difference between the distribution estimation based approach and the discriminant function based MCE approach lies in the way the recognition error is expressed and in the computational steps that would lead to the minimization of such error functions. A key to the development of the MCE method is a new error function which incorporates the recognition operation and performance in a functional form, from which the performance of the classifier can be directly evaluated and optimized. Classifier design without assuming the knowledge of class posterior probabilities has been studied in many areas. In particular, Tsypkin [72] and Amari [1, 2] pioneered this approach for self-learning and self-organizing nets. They formulated the problem of self-learning into a classification problem which consists of optimal partitioning of the observation space into regions Xk for which the expected risk R is minimized. In addition, a mathematical minimization procedure, generalized probabilistic descent (GPD) algorithm or stochastic approximation, was proposed as a means for classifier design under this framework. Since then, various loss functions have been used in designing classifiers, including those popular mean-square error based loss functions. However, many tractable loss functions do not have a direct relation to the recognition error rate minimization, and therefore, albeit based on discriminant functions, they are not directly related to recognition error rate which should be the most sensible choice for classifier design. Over the past decades, the MCE based approach has been developed to overcome the fundamental limitations of the traditional approach and to directly link the classifier design problem to classification error rate minimization. In order to alleviate the dependency on the class posterior distributions, a discriminant function based MCE approach was proposed by Juang et al. [32] as an alternative to optimal classifier design. Although this approach applies to the pattern recognition problem in general, it finds various

applications to speech recognition. It was first applied to dynamic time warping based recognition system [12, 36]. Application to hidden Markov model based continuous speech recognition systems was formulated as a segmental and string model based MCE approach [14, 15], and successful applications of this approach were reported in [16, 21, 25, 54, 55]. This approach was further extended to form a combined string model, in which training of model components of the speech recognition system can be achieved under a unified MCE framework [17, 29]. It was applied to discriminative model combination[9, 53] and to applications in speaker identification and verification [48, 26, 41]. The basic idea of MCE approach was further developed for applications in general utterance verification problem [71, 51]. A general framework of combining detection and verification in speech recognition and understanding was also proposed, in which the discriminant function based pattern recognition approach was applied in both detection and verification processes [35, 40].

The optimal classifier from the Bayes theory implements the “maximum a posterior” MAP decision:

2. DISCRIMINANT FUNCTION BASED CLASSIFIER DESIGN One of the important issues in classifier design is the selection of the loss function, upon which the optimal classifier design can be characterized as the one which achieves the minimum loss. To illustrate the concept, consider an M class classification problem where the classifier C (x) defines a mapping from the sample space x 2 X to the discrete categorical set Ci 2 Y . The loss function is a function from X  Y ! R, where R is a set of real numbers. Each class pair (j; i) can be associated with a cost or a loss function eji which signifies the cost of classifying (or recognizing) a class i observation to a class j event. The average loss associated to making a decision C (x) = Ci is given by

Discriminant functions on the other hand are those functions which characterize the decision rule of the classifier. They may or may not be probability or likelihood based functions, and they can come from different parametric families, including those families which have no relation to the parametric form of the class posterior distribution as required in the classical Bayes decision theory. One well studied family of discriminant function is the linear discriminant function which has received considerable attention and theoretical development for its design. A minimum error rate classifier design using linear discriminant function was studied in [22]. For M class classification problem using discriminant functions, a set of discriminant functions fgi(x) j i = 1:::M g are used and the classifier C (x) is defined such that

R(Ci j x) =

M X j =1

ejiP (Cj j x);

(1)

where P (Cj j x) is the class posteriori probability that the true state of x is in Cj . This leads to a performance measure for the classifier, i.e. the expected loss defined as

L=

Z

R(C (x) j x)dP (x);

(2)

where C (x) represents the classifier’s decision (assuming one of the M “values”, C1 , C2 :::CM ), based on a random observation x drawn from a probability distribution P (x). Obviously, if the classifier is so designed that for every x

R(C (x) j x) = min R(Ci j x); i

(3)

the expected loss in equation (2) will be minimized. If the loss function e ij = ij , a 0-1 delta function, the expected loss L is the error probability of the classifier and the conditional loss becomes

R(Ci j x) =

X i6=j

P (Cj j x) = 1 ; P (Ci j X ):

(4)

C (x) = Ci

iff

P (Ci j x) = max P (Cj j x): j

(5)

When true posterior probabilities are known, MAP decision implements an optimal classifier which minimize the expected loss based on (4). The minimum error rate achieved by MAP decision rule is called “Bayes risk”. However, if the true posterior probabilities are not known or the decision rule is not based on the class posterior probability, then we can not use this result directly to claim the optimality of the classifier. For most real world problems, the posterior probabilities in Bayes decision theory are not known priori. Our choice of the distribution used in the classifier is often limited by the mathematical tractability and very likely to be inconsistent with the true distribution. This means the true MAP decision can rarely be implemented and the minimum Bayes risk generally remains an unachievable lower bound.

C (x) = I

iff

I = argmax gi (x): i

(6)

When the loss function R(C (x) j x) is specified, the problem of optimal classifier design using discriminant functions becomes a minimization problem of finding a best set of discriminant functions fg i (x) j i = 1:::M g from a class of discriminant functions which minimizes the expected loss L as defined in Eq. (2). 3. MCE CLASSIFIER DESIGN USING DISCRIMINANT FUNCTIONS As pointed in the previous section that without the knowledge of the form of the true class posterior probabilities required in the classical Bayes decision theory, classifier design by distribution estimation often does not lead to an optimal performance. This motivates the effort of searching for other alternative criteria in classifier design. In particular, criteria of MMI (maximum mutual information) and MDI (minimum discriminative information) are used in many applications[3, 57]. Although these methods demonstrate significant performance advantages over

the traditional ML approach, they are not based on a direct minimization of a loss function which links to the classification error rate. Minimum classification error (MCE) rate based classifier design is based on a direct minimization of a loss function which relates to recognition error rate. Do-Tu et al. [22] studied MCE solution for the two class non-parametric classification problem using linear discriminant functions. A general approach for multi-class and non-linear discriminant functions are proposed by Juang et al. [32]. In MCE approach, the classifier design and parameter estimation are to correctly discriminate the observations for best recognition/classification results rather than to fit the distributions to the data. This is often achieved through a three steps process. 1) The misclassification measure in the MCE based approach is defined as

2 di (X ) = ;gi (X ; Λ)+log 4

1

X

M ; 1 j;j 6=i

31= exp[gj (X ; Λ)]5

(7) where  is a positive number [14]. For an ith class speech utterance X , di (X ) > 0 implies misclassification and di (X )  0 means a correct decision. When  approaches 1, the term in the bracket is the L norm on the discrete integer set fj j j 6= i; j = 1:::M g which converges to the k k1 norm and becomes maxj;j 6=i gj (X ; Λ). By varying the value of  and M , one can take all the competing classes into consideration, according to the individual significance, when searching for the classifier parameter Λ.

2) The loss function is used for recognition error rate minimization. The misclassification measure of (7) is embedded in a smooth zero-one function, for which any member of the sigmoid function family is an obvious candidate. A general form of the loss function can then be defined as:

`i (X ; Λ) = `(di (X ))

(8)

where ` is a sigmoid function, one example of which is

`(d) =

1 1 + exp(; d + )

(9)

with  normally set to 0 and set to greater or equal to one. Clearly, when di (X ) is much smaller than zero and negative, which implies correct classification, virtually no loss is incurred. When di(X ) is positive, it leads to a penalty which becomes essentially a classification/recognition error count. 3) The classifier parameter estimation is based on the minimization of the expected loss. For any unknown object X , the classifier performance is measured by

`(X ; Λ) =

M X i=1

`i (X ; Λ)1(X 2 Ci)

(10)

where 1() is the indicator function. The expected loss, which is related to recognition error rate, is given by

L(Λ) = EX [l(X ; Λ)]:

(11)

This 3-step process in MCE approach emulates the classification operation as well as the recognition error rate based performance evaluation in a smooth functional form, suitable for classifier parameter optimization. It is important to pint out that if the correct form of posterior probabilities PΛ (Ci j x) is used as the discriminant function, the minimum loss solution in MCE approach is an approximation to the Bayes minimum risk [20]. 3.1. Optimization Methods Various minimization algorithms can be used to minimize the expected loss. The generalized probabilistic descent (GPD) algorithm is a powerful algorithm that can be used to accomplish this task[2]. In GPD based minimization algorithm, the target function L(Λ) is minimized according to an iterative procedure: Λt+1

=

Λt ; t Ut r`(Xt ; Λ) jΛ=Λt

(12)

where Ut is a positive definite matrix[19], t is a sequence of positive numbers, and r`(Xt ; Λ) jΛ=Λt is the gradient function of the loss function at Λ = Λ t , and Xt is the tth training sample used in the sequential training process. The convergence properties of GPD algorithm was studied in the literature (e.g. [19, 11]) under various names, such as stochastic approximation, etc. MCE approach is very specific to the form and structure of the discriminant function and loss function regarding the classifier. It is relatively unrestricted to the optimization methods which can be used to minimize the expected loss. Many innovations are possible for better optimization results. In particular, methods of linear programming [58], gradient projection [28] and growth-transformation [27, 34, 57] are also used for minimization of the expected loss in MCE classifier design. 3.2. HMM as a Discriminant Function There are several ways of using hidden Markov models (HMMs) as discriminant functions. In particular, the discriminant function can be based on the joint observationstate probability defined as follows:

gi(X; q; Λ) = P (i)(X; q; Λ) = q(i) 0

T Y t=1

a(qit;) qt b(qit) (xt) 1

(13) Discriminant function defined in (13) is a natural choice for speech recognition system using Viterbi decoding, where a(qit;) 1 qt is based on the optimal state sequence. The MCE training algorithm based on (13) is often called segmental GPD, and MCE estimation of HMM parameters is derived for GPD based optimization method (see [33] for details). In order to maintain the probabilistic structure of HMMs, cares must be taken and parameters are often derived through certain transformations[14, 33].

3.3. Relation of MCE and MMI Maximum mutual information (MMI) [3] is another powerful approach used in speech recognition. The MMI approach is based on the mutual information I (Wc ; X ) between the acoustic observation X and its correct lexical symbol Wc . For the N class classification problem, the logarithm of the mutual information has the following form

p(Wc ; X ) p(Wc )p(X ) p(X j Wc ) = log( P ) (14) N k=1 p(Wk )p(X j Wk ) where Wk runs over all possible N class symbols, log p(X j Wc ) and log p(X j Wk ) are the log-likelihood scores of X on the correct lexical symbol and the k-th I (Wc ; X )

=

log

lexical symbol respectively. From Eq. (14), log P (Wc

j X ) = log( P (PW(Xc; )X ) ) = I (Wc ; X )+log p(Wc );

(15) which relates I (Wc ; X ) to the posterior probability p(W c j X ). In MMI training, the criterion of the classifier design and parameter estimation is to maximize the average mutual information I (W C ; X ) on the training set. The relation between MMI and MCE is a very interesting topic and studied in [63, 64]. It is found that under certain conditions, direct comparisons can be made between these two approaches. In particular, when  = 1, the loss function used in MMI approach relates to the misclassification measure in MCE approach as follows [63]:

I (Wc ; X ) = log(

1 ) + log N; 1 + edc (X )+log(N ;1)

(16)

where dc (X ) is the misclassification measure in MCE for correct class Wc . From the form of the loss function used in MCE and MMI, it is easy to see that MMI approach is based on modeling the posterior distribution which relates to the optimal decision boundary of the classifier through the Bayes theory. On the other hand, MCE approach is based on an explicit modeling of the decision boundary which has a direct relation to the recognition error rate. This direct relation to recognition error rate in the MCE approach has several advantages. It is meaningful in the sense of minimizing the empirical recognition error rate of the classifier, and this property is not dependent on the parametric form of the discriminant function nor its relation to the form of the true class posterior distribution. If the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximate the minimum Bayes risk. The significance of the MCE approach in speech recognition is twofold. First, a classifier design based on direct minimization of the recognition error rate is a meaningful alternative to distribution estimation based approach. Second, the probability distributions (PDs) used in parametric

modeling of speech are very limited comparing to the true PDs in the signal source, and the decision rule based on discriminant function approach is a reasonable alternative to “plug-in” MAP rule which is based on the model correctness assumption. More discussion and studies of these approaches were given in [20, 63]. 4. EMBEDDED STRING MODEL BASED MCE APPROACH In continuous speech recognition using sub-word model units, X is a concatenated string of observations belonging to different classes. For example, a sentence is a sequence of words, each of which is to be modeled by a distribution. The decoding process in continuous speech recognition is to compare (implicitly) all possible (word or subword) string models, and the word string whose string model has the highest likelihood score is chosen as the decoded string. The likelihood score of the word string is typically a combination of scores from various models, including scores from the acoustic model, language model, duration model, etc. The main reason of adopting this type of string model is simply that the basic speech recognition model units, which are used to form string models, can be estimated from a finite amount of available training data. On the other hand, the use of long term language model and context dependent acoustic model has extended the modeling dependencies beyond the level of individual words to phrase groups or even to the whole utterance level. In [15], an MCE approach to classifier design based on string model was proposed and studied. The string model for a given word string S in a HMM based speech recognition system using continuous observation densities is given by S¯Q = argmax log f (X; ΘSQ j SQ ) (17) SQ

where SQ is a string model for word string S, Θ SQ is the optimal state sequence in the string model of SQ and log f (X; ΘSQ j SQ ) is the log-likelihood score along the optimal state sequence ΘSQ . The discriminant functions, for k = 1; :::; N , are

g(X; Sk ; Λ) = log f (X; ΘSk ; Sk j Λ);

(18)

where Sk is the k-th best string, Λ is the HMM set used in the N -best decoding, Θk is the optimal path (state sequence) of the k-th string given the model set Λ, and log f (X; ΘSk ; Sk j Λ) is the related log-likelihood score on the optimal path of the k-th string. For the correct string Slex , the discriminant function is given by

g(X; Slex ; Λ) = log f (X; ΘSlex ; Slex j Λ);

(19)

where Slex is the correct string, Θlex is the optimal alignment path and log f (X; Θlex ; Slex j Λ) is the corresponding log-likelihood score. These discriminant functions are

embedded in the MCE based loss function. The embedded string model based MCE approach is well suited for acoustic modeling using detailed context dependent models. It can describe various context dependencies such as triphone, quinphone, etc. In embedded string model based MCE approach, the long term dependencies are embedded in the basic speech recognition model units even their original context dependency definitions are not. It is observed in the experiments that many mono-phone based context independent model units obtained from the MCE approach exhibit speech recognition performance of context dependent model units[25]. The embedded string model based MCE approach found applications in various recognition tasks and significant error rate reduction were observed [15, 16, 25, 55, 33, 21]. 5. COMBINED STRING MODEL BASED MCE APPROACH In speech recognition, the final decision is based on the combination of scores from various knowledge sources represented by different models. Assuming independence of each model, the final score in the logarithm domain becomes a sum of log-likelihood scores from each individual model. In particular, in addition to the acoustic model, if a language model is used and its score is weighted by a weighting factor L , the final likelihood score of a candidate string is log f (X

j W ) = log f (X j W ) + L log P (W ):

(20)

If the model correctness assumption is valid, the loglikelihood score should strictly follow Eq. (20) and the score weighting factor L = 1. However, in speech recognition experiments and applications, it is found that a value of L with L 6= 1 demonstrates much better recognition performance[42], an indication that the true distribution of the signal source departs from the assumption made by the model. The combined string model is an MCE approach which is based on the discriminant function at the level of global combined string model. The combined string model can be done in the combination of the following two directions. One is horizontal, scores from multiple models and different knowledge sources are combined to form the final score, where each individual model may be estimated separately based on different estimation methods, including using different training data and constraints [9]. Another important direction for model combination is to estimate the individual model parameters in the combined string model as an integrated component of the final combined string model, in which the discriminant function is constructed at the combined string model level. The estimation of parameters at each individual model is achieved by tracing down the model combination tree to each of its leaf nodes following a chain rule like relationship [17, 9, 38, 73]. 5.1. Discriminative Model Combination (DMC) In speech recognition, multiple signal sources from multiple signal bands are also used in speech recognition [53].

Many of these models or knowledge sources may not be based on probabilities, and a discriminant function based approach is a suitable choice for model combination. Let fΛ1 :::ΛM g be the individual model components in the model combination. We use the notation G(X j Λ1 :::ΛM ) to denote the combined string model given random input X , where G is the function selected for model combination. If G is linear

G(X j Λ1 :::ΛM ) =

M X k=1

k g(X j Λk );

(21)

where g(X j Λk ) is the score from the k-th model. Discriminative model combination based on the MCE approach is to embed the combined string model based discriminant function G(X j Λ1 :::ΛM ) in the error rate based loss function and estimate the model combination factor 1 :::M as parameters in the combined string model. Many optimization methods can be applied to estimate 1 :::M which minimize the expected loss. The popular GPD algorithm has a very simple form in this case[9]. DMC is applied in many applications under the name of combined string model[17], discriminative model combination[9] and universal stochastic engine[29]. The MCE based discriminant function based approach provides a goodness criterion for estimating and adjusting those “tuning parameters” in speech recognition, especially when either a model correctness assumption is not valid or a unified framework is needed to combine knowledge sources that are different in origin or nature. 5.2. Discriminative Feature Extraction The goal of discriminative feature extraction is to accomplish the speech recognition feature extraction from the standpoint of minimizing the recognition error rate for classification by machines. In place of the Bark scale from hearing, a new frequency scaling [62] can be derived. Feature extraction can be made discriminative based on the combined string model MCE paradigm. Since these speech recognition feature vectors are part of the combined string model, a goodness criterion of this approach can be derived from the relation of minimization of the expected loss and recognition error rate in the MCE formulation. One application of discriminative feature extraction is in the design of cepstral lifters[10]. The justification of this type of lifter for cepstral feature vectors is given in [31]. In discriminative feature extraction, lifter design can be done according to the combined string model formulation. Instead of relying on human hearing capabilities for choosing the right lifter, the lifter parameters can be estimated using the MCE criterion to minimize the recognition error rate. Speech experiments using discriminative feature extraction based lifter were performed in several tasks [10, 38]. 6. VERIFICATION AND IDENTIFICATION Speaker verification and identification based on voice is an important area in speech research and has been studied

for several decades. In making a decision regarding the true identity of the speaker, two types of errors may occur. One is false rejection (type I error) and one is false acceptance (type II error). The problem can be conveniently formulated as a statistical hypothesis testing problem to test null hypothesis H 0 (i.e. X from source S0 ) and the alternative hypothesis H1 (i.e. X from source S1 ). There are plenty of studies available in the statistical literature regarding the design of the optimal tests if P (X j H 0 ) and P (X j H1 ) are known and fall into some specific distributions, such as the exponential family [43]. In practice, the test procedure is often based on a test statistic T (X ) such that H0 is rejected if T (X )  k. According to the Neyman-Pearson Lemma, T (X ) can be based on probability ratio test T (X ) = P (X j H 1 )=P (X j H0 ) or likelihood ratio test T (X ) = f (X j H 1 )=f (X j H0 ), and k is selected such that the level of the type I error P (T (X )  k j H0) = . However, in practical verification problems, we have no exact knowledge regarding the distributions of null and alternative hypotheses. Discriminant function based methods can be applied in designing the verifier. 6.1. Speaker Verification and Identification In speaker verification and identification, the decision is based on the score of the test statistic T (X ). For speaker identification problem, the speaker I is identified as the true speaker if

I = argmax T (X; q): q

(22)

It can be seen that speaker identification can be formulated as an M class classification problem using associated test statistics. A suitable misclassification measure can be constructed according to the test statistics [48]. For speaker verification, we usually accept the claimed speaker identity I if the test statistic for I -th speaker

T (X; I ) > I :

(23)

For speaker verification, the verification error can also be characterized by a mis-verification measure based on the test statistic T (X; I ), similar to the misclassification measure in recognition. The MCE based discriminant function approach can be adapted to minimize the verification error[48, 41]. In addition to addressing the identification and verification problem as a special classification problem, methods of introducing the structure of statistical test in discriminant function based approach are also attempted. Minimum verification error (MVE) training is such an approach[51, 69]. The main difference to MCE approach is the use of two separate loss functions to model two types of errors in hypothesis testing. We describe the details of MVE approach below and exemplify the discriminant function approach to speaker verification and identification. The mis-verification measure, as opposed

to the misclassification measure, for the q-th speaker is defined as follows:



;gq (X; Λ) + Gq (X; Λ) ;Gq (X; Λ) + gq (X; Λ)

from q otherwise; (24) where gq (X; Λ) is the log-likelihood score from the claimed q-th speaker model and Gq (X; Λ) is the score from the cohort speaker group Cq¯ . The mis-verification measure is embedded in a smooth sigmoid based loss function `(d(X; Λ)) as in the MCE formulation. Two separate loss functions are used to describe the type I and type II errors. The average loss for each type of errors approximates the empirical type I and type II error rate on the training samples.

dq (X; Λ) =

L1 (X; Λ) =

1

N1

and

L2 (X; Λ) =

1

N2

N X 1

i=1

N X 2

i=1

`(d(Xi ; Λ))1(Xi 2 Claimed Speaker) (25)

`(d(Xi ; Λ))1(Xi 62 Claimed Speaker);

(26) where 1() is the indicator function. The overall expected loss of the MVE is given by

L(X; Λ) = 1 L1 (X; Λ) + 2 L2 (X; Λ); (27) where 1 and 2 are design parameters which control the influence of type I and type II errors in the overall loss function. The model parameter estimation in MVE training is to minimize the expected loss of Eq. (27), which relates to the minimization of empirical error rate of type I and type II errors. Various speaker verification experiments are conducted and the discriminant function based approach demonstrates significant performance advantages over the distribution estimation based approach[48, 26, 41]. 7. SUMMARY In this paper, the discriminant function based MCE approach was introduced as an alternative to the distribution estimation based approach from Bayes decision theory. In MCE approach, the classifier design is to find a set of parameters which minimize the empirical recognition error rate. This is achieved through a special loss function where minimizing the expected loss relates to the reduction of the recognition error rate. The discriminant function based MCE approach applies to cases where the traditional distribution estimation based approach does not apply, especially when the family of the discriminant functions encountered in the classifier are not based on probability distributions. The goodness of this approach is justified without the model correctness assumption, and it applies to cases where the model correctness assumption is known to be invalid. Although significant progress has been made during the last ten years, many issues in discriminant function based MCE approach are still open and we are just beginning to look at new potentials of this approach in speech recognition.

8. REFERENCES 1. S.-I. Amari, “A Theory of Adaptive Pattern Classifiers”, IEEE Trans. on Electronic Computers, Vol. 16, No. 3, pp. 299–307, 1967. 2. S.-I. Amari “Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements” IEEE Transactions on Computers, Vol. C-12, No. 11, 1197–1206, November 1972. 3. L. R. Bahl, P. F. Brown, P. V. deSouza and R. L. Mercer, “Maximum mutual information estimation of HMM parameters for speech recognition”, Proceedings of ICASSP-86, pp49–52, 1986. 4. L. R. Bahl, P. F. Brown, P. V. deSouza and R. L. Mercer, “Estimating Hidden Markov Model Parameters so as to Maximize Speech Recognition Accuracy”, IEEE Trans. Speech and Audio Processing, Vol. 1, No. 1, pp. 77–83, 1993. 5. L. R. Bahl, F. Jelinek and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition” IEEE Transactions on Pattern and Machine Intelligence, Vol. PAMI-5, pp. 79–190, Mar, 1983. 6. L. E. Baum and J. A. Eagon “An inequality with applications to statistical predication for functions of Markov process and to model of ecology”, Bull. Amer. Math Soc.,, Vol. 73, pp. 360–363, 1967. 7. L. E. Baum and G. Sell “Growth transformations for functions on manifolds” Pacific J. Math., Vol. 27, No 2, pp. 211–227, 1968. 8. Bickel and Doksum “Mathematical Statistics” PrenticeHall, Inc., 1977. 9. P. Beyerlein “Discriminative Model Combination” Proc. 1997 Workshop on Automatic Speech Recognition and Understanding Proceedings, pp. 238–245, Dec. 1997. 10. A. Biem, S. Katagiri and B.-H. Juang “Pattern recognition using discriminative feature extraction”, IEEE Trans. Signal Processing, vol. 45, pp. 500–504, Feb. 1997. 11. J.R. Blum, “Multidimensional Stochastic Approximation Methods”, Ann. Math. Stat. Vol 25, pp. 737–744, 1954. 12. P.-C. Chang, and B.-H. Juang “Discriminative Template Training for Dynamic Programming Speech Recognition” Proc. ICASSP92, Vol. 1, pp. 493 – 496, 1992 13. P.-C. Chang, and B.-H. Juang “Discriminative Training for Dynamic Programming based Speech Recognizers” IEEE Trans. Speech and Audio Processing, SAP-1(2):135-143, April 1993. 14. W. Chou, C.-H. Lee and B. H. Juang, “Segmental GPD training of an hidden Markov model based speech recognizer,” IEEE Proc. ICASSP-92, pp. 473–476, Apr. 1992. 15. W. Chou, C.-H. Lee and B.-H. Juang, “Minimum error rate training based on N-best string models,” IEEE Proc. ICASSP-93, II-652–655, April 1993. 16. W. Chou, C-H. Lee and B-H. Juang, “Minimum error rate training of inter-word context dependent acoustic model units in speech recognition”, Proc. ICSLP-94 pp. 439–442, Yokohama, 1994. 17. W. Chou, C.-H. Lee and B.-H. Juang “Speech Recognition based on Combined String Models”, Proc. DARPA ANN Tech. Program CSR Mtg., pp 65–69 Sept. 1992. 18. W. Chou, C.-H. Lee, B.-H. Juang and F. K. Soong, “A Minimum Error Rate Pattern Recognition Approach to Speech Recognition”, International Journal of Pattern Recognition and Artificial Intelligence Vol. 8 No. 1, pp. 5–31, 1994. 19. W. Chou, B.-H. Juang, “Adaptive Discriminative Learning in Pattern Recognition”, Technical Report of AT&T Bell Laboratories, 1992

20. W. Chou, “Minimum Classification Error Rate Based Discriminant Function Approach to Speech Recognition”, Proc. IEEE to appear. 21. S. M. Chu and Y. Zhao “Robust Speech Recognition Using Discriminative Stream Weighting and Parameter Interpolation” IEEE Proc. ICSLP’98, pp. 690–694, Nov. 1998. 22. Hai Do-Tu and Michael Installe “Learning Algorithms for Nonparametric Solution to Minimum Error Classification Problem,” IEEE Transactions on Computers, Vol. C-27, No. 7, 648–659, July 1978. 23. Y. Ephraim and L. Rabiner “On the Relation Between Modeling Approaches for Speech Recognition” IEEE Transactions on Information Theory, vol. 36, No. 2, pp. 372–380, March 1990. 24. Y. Ephraim and A. Dembo and L. Rabiner “A minimum Discrimination Information Approach for Hidden Markov Modeling” IEEE Transactions on Information Theory, vol. 35, No. 5, pp. 1001–1013, March 1989. 25. M. B. Gandhi and J. Jacob “Natural Number Recognition Using MCE Trained Inter-Word Context Dependent Acoustic Models” IEEE Proc. ICASSP’98, pp. 457–461, 1998. 26. C. M. del Alamo et al. “Discriminative Training of GMM for Speaker Identification” IEEE Proc. ICASSP’96, pp. 89– 93, 1996. 27. P.S. Gopalakrishnan, et al., “An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems", IEEE Trans. on Information Theory, vol. 37, no. 1. pp. 107-113, 1991. 28. Q. Huo and C. Chan “The Gradient Projection Method for the Training of Hidden Markov Models”, Speech Communication, Vol. 13, pp. 307–313, 1993. 29. X. Huang, M. Belin, F. Alleva and M. Hwang “Unified Stochastic Engine (USE) for Speech Recognition” IEEE Proc. ICASSP-93, pp. 636–639, Apr. 1992. 30. F. Jelinek Statistical Methods for Speech Recognition, The MIT Press, Cambridge, 1997. 31. B.-H. Juang, L. R. Rabiner and J. G. Wilpon, “On the use of bandpass littering in speech recognition,” IEEE Trans. Acoust. Speech Signal Processing, ASSP-35, No. 7, 947– 954, 1987. 32. B.-H. Juang and S. Katagiri “ Discriminative Learning for Minimum Error Training”, IEEE Trans. Acoust., Speech & Sig. Proc., 40(12): 3043-3054, Dec, 1992. 33. B.-H. Juang, W. Chou and C.-H. Lee, “Minimum classification error rate methods for speech recognition”, IEEE Trans. on Speech and Audio Processing, 5(3), May 1997. 34. D. Kavensky, “Generalization of Baum Algorithm to Functions on Non-linear Manifolds”, Proc. ICASSP’95 Vol 1. pp. 473-476, Detroit, 1995. 35. T. Kawahara, C.-H. Lee and B.-H. Juang, “Key-phrase detection and verification for flexible speech understanding”, IEEE Transactions on Audio and Speech Processing 1998. 36. T. Komori and S. Katagiri “Application of a Generalized Probabilistic Descent Method of Dynamic Time Warping Based Speech Recognition” IEEE Proc. ICASSP-92, pp. 497–500, 1992. 37. S. Katagiri, C.-H. Lee, B.-H. Juang and T. Komori “New Discriminative Training Algorithms Based on a Generalized Probabilistic Descent Method’, Proc. IEEE-SP Workshop on Neural Networks for Signal Processing, Princeton, 1991. 38. S. Katagiri, B.-H. Juang and A. Biem “Discriminative feature extraction”, Artificial Neural Networks for Speech and Vision, R. Mammone, Ed. London, U.K. Chapman and Hall, 1994.

39. S. Katagiri, B.-H. Juang and C.-H. Lee, “Pattern recognition using a family of design algorithms based upon the generalized probability descent method”, IEEE Proceedings, 86(11): pp. 2345-2373, 1998. 40. M.-W. Koo, C.-H. Lee and B.-H. Juang, “Speech recognition and utterance verification based on a generalized confidence score”, IEEE Transactions on Speech and Audio Processing, 1999. 41. F. Kormanzskiy and B.-H. Juang “Discriminative Adaptation for Speaker Verification IEEE Proc. ICSLP’96, pp. 1744–1747, 1996. 42. K.-F. Lee “The Development of the SPHINX System”, Kluwer, 1989. 43. E. L. Lehmann, Testing Statistical Hypotheses, Wiley, New York, 1959. 44. E. Lleida and R. Rose, “Efficient decoding and training procedures for utterance verification in continuous speech recognition”, Proc. ICASSP’96, May 1996. 45. C.-H. Lee, B.-H. Juang, W. Chou and J.J. Molina-Perez, “A study on task-independent subword selection and modeling for speech recognition”, Proc. ICSLP96, pages 1816-1819, Philadelphia, 1996. 46. C.-H. Lee, F. K. Soong and Paliwal Eds., “Automatic Speech and Speaker Recognition”, Norwell, MA, Kluwer, 1996. 47. C.-H. Lee, “A tutorial on speaker and speech verification”, Proc. NORSIG’98, pp. 9-16, June., 1998. 48. C.-S. Liu, C.-H. Lee, W. Chou, B.-H. Juang and A. Rosenberg, “A Study on Minimum Error Discriminative Training For Speaker Recognition”, J. Acoust. Soc. Am. Vol.97, No. 1, pp. 637–648, 1995. 49. M. G. Rahim and C.-H. Lee “Simultaneous ANN feature and HMM recognizer design using string-based minimum classification training of HMMs”, Proc. ICSLP’96, pp. 1824–1827, 1996. 50. M. G. Rahim, C.-H. Lee, B.-H. Juang and W. Chou “Discriminative utterance verification using minimum string verification error (MSVE) training”, Proc. ICASSP’96, pp. 3485–3588, 1996. 51. M. G. Rahim and C.-H. Lee “String based minimum verification error (SB-MVE) training for flexible speech recognition”, Computer, Speech and Language, 1997. 52. M. G. Rahim, C.-H. Lee and B.-H. Juang “Discriminative utterance verification for connected digit recognition”, IEEE Transaction on Speech and Audio Processing, 5(3), 1997. 53. P. McMahon, N. Harte, S. Vaseghi and P. McCourt “Discriminative Spectral-Temporal Multi-Resolution Features for Speech Recognition” IEEE Proc. ICASSP’99, pp. 1649– 1653, 1999. 54. E. McDemot and S. Katagiri “Prototype-based MCE/GPD training for various speech units”, Comput. Speech Language, vol 8, pp. 351–368, 1994.

58. K. A. Papineni, “Discriminative Training Via Linear Programming", Proc. ICASSP’99. 59. K. K. Paliwal, M. Bacchiani and Y. Sagisaka “Minimum classification error training algorithm for feature extraction and pattern classifier in speech recognition” Proc. EuroSpeech’95, pp. 541–554. 60. C. Rathinavelu and L. Deng “Use of Generalized Dynamic Feature Parameters for Speech Recognition: Maximum Likelihood and Minimum Classification Error Approaches”, IEEE Proc. ICASSP’95, pp. 373–376, 1995. 61. C. Rathinavelu and L. Deng “The Trend HMM with Discriminative Training for Phonetic Classification” IEEE Proc. ICSLP’96. 62. C. Rathinavelu and L. Deng “HMM based Speech Recognition Using State-Dependent, Discriminatively Derived Transforms on Mel-Warped DFT Features” IEEE Proc. ICASSP’96, pp. 9–13, 1996. 63. W. Reichl and G. Ruske, “Discriminative Training for Continuous Speech Recognition”, Proc. 1995 EuroSpeech’95 Vol 1. pp. 537-540, Madrid Sept. 1995. 64. R. Schluter, Wolfgang Macherey “Comparison of Discriminative Training Criteria” IEEE Proc. ICASSP’98, pp. 493– 497, 1998. 65. R. Schluter, W. Macherey, S. Kanthak, H. Ney and L. Welling “Comparison on Optimization Methods for Discriminative Training Criteria” IEEE Proc. EuroSpeech’97, pp. 15–18, Sept. 1997. 66. R. Schluter, B. Mueller, F. Wessel and H. Ney, “Interdependence of Language Models and Discriminative Training” Proc. ASRU’99, pp. 85–89, Dec 1999. 67. A. R. Setlur, R. A. Sukkar and J. Jacob “Correction Recognition Errors Via Discriminative Utterance Verification” IEEE Proc. ICSLP’96. 68. R. Sukkar and C.-H. Lee “Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition”, Task Independent Utterance Verification” IEEE Transactions on Speech and Audio Processing, Vol. 4, pp. 420–429, Nov. 1996. 69. R. Sukkar “Subword Based Minimum Verification Error (SB-MVE) Training for Task Independent Utterance Verification” IEEE Proc. ICASSP’98, pp. 229–233, 1998. 70. R. Sukkar, M Rahim and C.-H. Lee “Utterance verification of keyword strings using word based minimum verification error (WB-MVE) training”, Proc. ICASSP’96, pp. 516– 519, May 1996. 71. R. Sukkar, A. R. Setlur, C.-H. Lee and J. Jacob, “Verifying and correcting string hypotheses using discriminative utterance verification”, Speech Communication, 22:333-342, 1997. 72. Ya. Z. Tsypkin “Self-Learning – What is It?” IEEE Transactions on Automatic Control, Vol. AC-13, No. 6, 608–612, December 1968.

55. E. McDemot and S. Katagiri “String-level MCE for continuous phoneme recognition”, Proc. EuroSpeech’97 vol 1, pp. 123–126, 1997.

73. V. Warnke, S. Harbeck, E. Noth, H. Niemann and M. Levit “Discriminative Estimation of Interpolation Parameters for Language Model Classifier” IEEE Proc. ICASSP’99, pp. 1223–1227, Mar. 1999.

56. A. Nadas, D. Nahamoo and M. A. Picheny, “On a ModelRobust Training Method for Speech Recognition”, IEEE Trans., on Acoustics, Speech and Signal Processing, vol 36, No. 9, pp. 1432–1436, 1988.

74. V. Valtchev, J. J. Odell, P. C. Woodland and S. J. Young, “Lattice-Based Discriminative Training for Large Vocabulary Speech Recognition”, IEEE Proc. ICASSP’96, pp. 605–608, May. 1996.

57. Y. Normandin, et al., “High Performance Connected Digit Recognition Using Maximum Mutual Information Estimation”, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 2. pp. 229-311.

Suggest Documents