70
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 1, JANUARY 2003
Linear Regression Based Bayesian Predictive Classification for Speech Recognition Jen-Tzung Chien, Member, IEEE
Abstract—The uncertainty in parameter estimation due to the adverse environments deteriorates the classification performance for speech recognition. It becomes crucial to incorporate the parameter uncertainty into decision so that the classification robustness can be assured. In this paper, we propose a novel linear regression based Bayesian predictive classification (LRBPC) for robust speech recognition. This framework is constructed under the paradigm of linear regression adaptation of speech hidden Markov models (HMMs). Because the regression mapping between HMMs and adaptation data is ill posed, we properly characterize the uncertainty of regression parameters using a joint Gaussian distribution. A closed-form predictive distribution can be derived to set up the LRBPC decision for speech recognition. Such decision is robust compared to the plug-in maximum a posteriori (MAP) decision adopted in the maximum likelihood linear regression (MLLR) and MAP linear regression (MAPLR). Since the specified distribution belongs to the conjugate prior family, the evolutionary hyperparameters are established. With the statistically rich hyperparameters, the LRBPC achieves decision robustness. In the experiments, we find that LRBPC decision in cases of general linear regression as well as single variable linear regression attains significantly better recognition performance than MLLR and MAPLR adaptation. Index Terms—Bayesian predictive classification, conjugate prior distribution, joint Gaussian distribution, linear regression model, speech recognition.
I. INTRODUCTION
R
OBUSTNESS is a crucial issue for speech recognition in real-world applications because the mismatch between training and testing data always exists and degrades the recognition performance considerably. The mismatch generally comes from the variabilities of inter and intra speakers, transducers/channels, ambient noises, etc. One may collect the environment-specific adaptation data to adjust the speaker-independent (SI) hidden Markov models (HMMs) to fit the acoustics of test speaker/transducer/noise [13]. The desirable recognition performance can be obtained. In the literature, the maximum likelihood linear regression (MLLR) adaptation [15] has been recognized as an effective approach to environmental adaptation. MLLR was a transformation-based adaptation where the overall HMMs were transformed via the cluster-dependent linear regression functions estimated by the maximum Manuscript received March 9, 2001; revised September 30, 2002. This work was supported in part by the National Science Council, Taiwan, ROC, under Contract NSC90-2213-E006-048. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jerome R. Bellegarda. The author is with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan 70101, R.O.C. (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSA.2002.805640
likelihood (ML) theory. However, since the collected adaptation data may be sparse and contain the noise contamination and error-prone decoding problems, the transformation from training to testing environments will occur the nonuniqueness and the noncontinuity conditions. The model transformation becomes an ill-posed reconstruction problem. This problem causes the difficulties in the estimation of least-error regression parameters especially in case of unsupervised adaptation. To deal with the uncertainty/perturbation in parameter estimation, one useful scheme is to allow the parameter randomness put into the decision rule. The minimax classification [19], [21] and Bayesian predictive classification (BPC) [1], [5], [12], [16], [18] were feasible to attain robust decision rules. The minimax classification was presented to minimize the worst-case probability of error. Using BPC, the decision criterion was established by considering the parameter randomness via a probabilistic distribution. In this study, we exploit a linear regression based Bayesian predictive classification (LRBPC) algorithm, which unifies the effectiveness of linear regression adaptation and the robustness of BPC decision, to reinforce the performance of speech recognition. Assuming we are given a set of continuous-density HMMs , the state observation probability density function (pdf) of time sample is modeled by a mixture of multivariate Gaussian distributions (1) is the mixture gain with , is the where -dimensional mean vector, is the covariance matrix is the Gaussian distribution denoted by and
(2) Using the linear regression adaptation [15], we attempt to adapt of state and mixture component the HMM mean vector by applying a cluster-dependent linear regression mato the extended mean vector . trix The adapted HMM parameters are acoustically close to the test environments. With the adapted mean vector, the observation turns out to be likelihood of
(3) is attributed to cluster membership , Here, the HMM pdf which could be defined either by the phonological rule or examining the acoustical closeness of individual HMM pdfs. Given
1063-6676/03$17.00 © 2003 IEEE
CHIEN: LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION FOR SPEECH RECOGNITION
some adaptation data regression parameters lated likelihood of
, the MLLR estimates a set of by maximizing the accumu(4)
Recently, the maximum a posteriori linear regression (MAPLR) was presented to improve MLLR because the ML theory often led to a biased estimate by using sparse data [3], [4]–[7]. The MAP regression parameter is estimated as follows
(5) is the prior pdf with hyperparameter describing where the parameter statistics. Since the estimation in (4) and (5) occurred incomplete data problem, the expectation–maximization (EM) algorithm [9] was employed to accomplish MLLR and MAPLR. Conventionally, the transformation-based adaptation using MLLR and MAPLR plugged the estimated regression parameter into MAP decision rule to search the most likely word associated with the observation sequence
(6) Here, the parameters , and are assumed to be indepencorresponds dent. The prior probability of word sequence to the language model. Using the plug-in MAP decision, the regression estimate is deterministic and acts as true value to achieve the optimal Bayes decision. However, the inevitable estimation error due to the ill-posed transformation from the trained models to the testing environments makes the recognition performance unacceptable. One safe approach to compensate the estimation error is attempted to average over the uncertainty of regression parameters and construct a new decision rule. Namely, we replace the likelihood function in plug-in MAP decision using a predictive distribution [1], [12]
(7) taking an integral/expectation over the random regression pa. The LRBPC decision is rameter with prior density accordingly explored for robust speech recognition. Moreover, we properly select the prior density from the conjugate prior family such that the evolutionary hyperparameters are established to capture the latest information of parameter uncertainty. II. LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION Nadas [22] first introduced the concept of BPC decision into the application of speech recognition. All training data should be stored to determine the parameter randomness during recognition. Huo et al. [12] developed a BPC where the predictive pdf was computed by the Laplace method for integrals without storage of training data. For HMM based speech recognizer, the
71
predictive pdf was approximated using a Viterbi search algorithm where the most likely HMM state and mixture component sequences were decoded to carry out the BPC decision [17]. The constrained uniform distribution was adopted to characterize the randomness of HMM mean vector. Also, the Gibbs [23] and Gaussian [17] distributions were employed. Further, a Bayesian predictive adaptation was presented to tackle the uncertainty of transformation parameter of HMM mean vector. A simple bias transformation was considered [24]. Recently, a transformation-based BPC geared with online prior evolution (TBPC-OPE) was proposed for robust speech recognition and online environmental learning [5]. The transformation uncertainty of both HMM mean vector and covariance matrix was described using a multivariate normal-Wishart density. A sequential learning algorithm of prior statistics was derived to trace the changing environments. Our proposed LRBPC is also a realization of TBPC. Different from previous works, LRBPC focuses on developing a new BPC framework where the popular and effective linear regression model is adopted for model transformation. The uncertainty of regression parameters is merged into decision rule so as to improve the MLLR using plug-in MAP classifier. A. LRBPC for HMM Based Speech Recognition In HMM based speech recognition, the LRBPC should be implemented by resolving the missing data problem [9]. Namely, the predictive pdf in (7) is approximated through finding the best state sequence and mixture component sequence (8) A frame-synchronous Viterbi Bayesian search algorithm was presented to find the optimal sequences as well as the predictive pdf [16], [18]. This algorithm could not exactly mimic the approximation in (8) because the partial predictive likelihood at each time moment was not exact during search [18]. Alternatively, a direct approximation of LRBPC is to formulate the predictive pdf of a whole utterance by individually computing associated with the HMM the predictive pdf of each frame of state and mixture component pdf [5], [18] (9) This frame-based predictive pdf is referred as a new observation pdf and put into the standard Viterbi decoder to search the and obtain the recognized word string optimal sequences . The predictive pdf of corresponds to can be written by
(10) is the initial state probability, is the state tranwhere is the mixture coefficient in HMMs. sition probability and The LRBPC decision is accordingly accomplished for speech recognition. It becomes crucial to derive the frame-based predictive pdf taking the randomness of regression matrix into acwere nearly detercount. Notably, if the regression matrix
72
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 1, JANUARY 2003
ministic with value , the shape of prior density would be very sharp at center . The frame-based predictive pdf turns out to be the original observation pdf given , i.e., . LRBPC is therefore reduced to , we first address two cases MLLR. Before derivation of of linear regression model and the specification of their prior densities.
Here, involves a double integral. The regression vector . In [4], the regression parameters is reduced to and were shown to be dependent in essence. Separate distribution modeling of regression parameters is unreasonable. Hence, it is still a good choice to adopt the joint Gaussian distribution to characterize the uncertainty of regression vector . is specified by Its prior density
B. Linear Regression Model and Prior Distribution Specification In linear regression adaptation, the extended HMM mean is adapted by multiplying a regression matrix . vector is assumed to be unchanged. HMM covariance matrix This model is also viewed as an affine transformation, , where the HMM mean is scaling matrix and transformed by multiplying a bias vector [4], [10]. In general, the specificaadding a tion of prior density for matrix parameter is not straightforward as that for vector parameter. A simple alternative is to rewrite a matrix into an extended long vector. Let denote the collection regression matrix expressed of all row vectors of . The uncertainty of regression by can be modeled using a joint Gaussian distribution vector mean vector and with covariance matrix , i.e., . We select the distribution from the conjugate prior family due to the mathematical tractability for LRBPC. Basically, the general linear regression model can be also simplified to the case of single variable linear regression where the adaptation between various components of HMM mean vector is independent [4], [15]. In this case, the is diagonal, . The th scaling matrix . HMM mean component is adapted by regression vector We may generate an extended . Its parameter uncertainty can be also described in a joint Gaussian distribution. Correspondingly, the derivation of frame-based predictive pdf for the single variable linear regression is analogous to that for general linear regression. In fact, in cases of single variable linear regression , and diagonal HMM covariance matrix the multivariate frame-based predictive pdf in (9) can be fulfilled by individually computing the univariate frame-based predictive pdf for each observation component
(12) and are Gaussian distribuwhere the marginal pdfs of and , tions, . and the cross variance is This joint Gaussian distribution can be also expressed by with and
III. FRAME-BASED PREDICTIVE DISTRIBUTION USING EVOLUTIONARY HYPERPARAMETERS A. Predictive PDF for General Linear Regression For the case of general linear regression, the frame-based preinvolves in resolving the integral dictive pdf
(13) The adaptation of HMM mean vector is rewritten by with an extended mean matrix as shown in (14) at the bottom of the page. As shown in Appendix A, the multivariate frame-based predictive pdf can be derived as a closed-form equation given by
(15) where (11)
.. .
.. .
..
.
.. .
.. .
.. .
(16)
..
.
.. .
.. .
.. .
..
.
.. .
(14)
CHIEN: LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION FOR SPEECH RECOGNITION
73
(17) (23) (18) B. Predictive PDF for Single Variable Linear Regression When the single variable linear regression is considered, we aim at resolving a double integral problem with respect to the and . Under the specification of regression parameters joint Gaussian distribution in (12), we fortunately derive the unias a Gaussian disvariate frame-based predictive pdf with new mean and new variance given tribution of by
(24)
(25)
(19)
(20) is an affine function of HMM We can see that new mean with hyperparameters and as the transformean mation factors. Detail formulation can be found in Appendix A. Because of such simple and attractive formula, the implementation of single variable linear regression is much easier than that of general linear regression. Also, the computation cost of LRBPC using single variable linear regression would be much cheaper than that using general linear regression. C. Evolutionary Hyperparameters In fact, the use of conjugate prior density for modeling the uncertainty of regression parameters is not only feasible to achieve a closed-form frame-based predictive pdf but also appropriate for building the evolutionary hyperparameters [5], [17]. The LRBPC decision is accordingly equipped with the desirable hyperparameters to deal with ill-posed reconstruction problem. The evolutionary hyperparameters can be estimated as follows. When we are given some adaptation/test data , the Bayesian learning of regression parameters or their hyperparameters could be achieved through finding the posterior pdf
(21) are decoded to obtain the apThe optimal sequences proximated posterior pdf. In this study, the regression vectors and single variable linear of general linear regression are jointly Gaussian. The pooled posterior pdfs regression and can be also formulated as the joint Gaussian distributions with the updated hyperparameters
(22)
and . where Such reproducible prior/posterior pair provides a reasonable mechanism for evolving hyperparameters. The hyperparameters could be continuously updated according to (22)–(25) as soon as the latest environment-specific data are enrolled. In Appendix B, the derivation of evolutionary hyperparameters for the case of single variable linear regression is formulated. Notably, the modes of the posterior pdfs, i.e., (22) and (24), are exactly equivalent to the MAP regression parameters for MAPLR [3], [4], [7]. IV. EXPERIMENTS Two speaker adaptation tasks are conducted to examine the performance of the proposed LRBPC algorithm. The first adaptation task is performed on a connected Chinese digit recognition system [5]. The second adaptation task is to recognize 408 isolated Mandarin syllables [6] A. Experimental Setup For the first adaptation task, we collected a training database consisted of 1000 close-talking utterances of 50 males and 50 females. It was used to estimate the SI HMMs as well as the initial hyperparameters used for MAPLR and LRBPC. The test database CARNAV98 contained the utterances of five males and five females recorded in two cars: TOYOTA COROLLA 1.8 (median class) and YULON SENTRA 1.6 (low-grade class). These utterances were collected using a high-quality MD Walkman of type MZ-R55 via a hands-free far talking SONY ECM-717 microphone. Three materials of standby condition, downtown condition and freeway condition with averaged car speeds respectively being 0 km/h, 50 km/h and 90 km/h were recorded. During recording, we kept the engine on, the air-conditioner on, the music off and the windows rolled up. The numbers of testing utterances were 50, 150, 250 and corresponding digits were 324, 964, 1593 for driving conditions of standby, downtown and freeway, respectively. The word error rate (WER) was averaged over ten test speakers. Test speakers were excluded from the training speakers. Each speaker at various driving conditions provided extra three utterances for either HMM adaptation or hyperparameter estimation. The supervisions of adaptation utterances were provided. A digit string contained three to eleven random digits. Each Chinese digit was modeled using genderdependent continuous-density HMM. Each HMM state had at
74
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 1, JANUARY 2003
most eight mixture components. During recognizer search, we limited the maximum number of digits in an utterance [2]. Our baseline system (plug-in MAP decision using SI HMMs) reported the WERs of 20.9%, 49.3%, and 53.6% for driving conditions of standby, downtown and freeway, respectively. For the second adaptation task, we prepared a training database containing 5045 phonetically balanced Mandarin words spoken by 51 males and 50 females. Each Mandarin word had two to four Mandarin syllables. This database covered all acoustics of 408 Mandarin syllables. The trained SI HMMs were used to recognize three repetitions of 408 isolated Mandarin syllables uttered by two males and two females. Before recognition, four speakers enrolled the system by individual one repetition of syllables. Without adaptation, the baseline system had an averaged top-five recognition rate of 74.5%. All utterances were sampled at 8 kHz with 16-bit resolution. The feature vector consisted of 12-order LPC-derived cepstral coefficients, 12-order delta cepstral coefficients, one delta log energy and one delta delta log energy referred to [14]. For comparative study, we carry out the plug-in MAP decision with MLLR [15] and MAPLR [3], [7] and the BPC decision with TBPC [5] and LRBPC. Two cases of linear regression are realized for LRBPC. The cepstral mean subtraction (CMS) [11] is also included for evaluation. B. Initial Hyperparameters and Implementation Issues In this study, the SI training database is applied to empirically estimate a set of initial hyperparameters for the use of MAPLR and LRBPC. The estimated hyperparameters should be general to represent the statistics/uncertainty of linear regresdesion parameter under various environments. Let note the training data sets from speakers. We first employ these data and the SI HMMs in MLLR to calculate the ML regression parameters. In case of general linear regression, the corresponding to and regression parameters individual training speakers are obtained. The initial hyperpaand are estimated by respectively taking the rameters ensemble mean and ensemble covariance over the parameters , i.e., and . Similarly, the hyperparameters of single variable linear regression for and dimension , and , can be extracted by cluster respectively taking the ensemble mean and covariance over the , i.e., ML regression parameters and . The ML regression parameters are obtained via MLLR and using the utterances of training speakers. was In our experiments, the HMM covariance matrix diagonal. Only one EM iteration was performed in MLLR and MAPLR. To avoid data sparseness, we specified a single regression/transformation cluster in the first adaptation task. The number of regression cluster was properly increased for large number of adaptation data in the second adaptation task. and were simplified to diagonal The hyperparameters matrices. The initial hyperparameters of TBPC were the same as those in [5]. In TBPC, only the uncertainty of HMM mean vector was considered. Although the formulation in (22)–(25) could be sequentially adopted to refresh hyperparameters for LRBPC decision, herein, we just used the adaptation utterances to estimate hyperparameters in one epoch. This process was done before LRBPC decision of the utterances of a specific
test speaker and driving condition. Consistently, in MLLR and MAPLR, we applied the same adaptation utterances for batch and supervised speaker adaptation. Plug-in MAP decision was employed in speech recognition using the adapted HMMs. The case of single variable linear regression was considered in MLLR and MAPLR. C. General Linear Regression Versus Single Variable Linear Regression First of all, we would like to evaluate the effect of linear regression model in LRBPC decision. Here, the frame-based predictive pdfs derived in (15) and (33) are calculated to fulfill the LRBPC decision in cases of general linear regression and single variable linear regression, respectively. Commonly, the general linear regression model is feasible to handle the transformation across various feature dimensions. The single variable linear regression views the transformation of individual dimension independently. As plotted in Fig. 1, the WERs of baseline system and two LRBPC realizations are determined. The first adaptation task is evaluated. We find that two LRBPC realizations outperform baseline results a great deal under different driving conditions. However, the case of general linear regression performs slightly worse than that of single variable linear regression. The reason is partially due to the amount of adaptation data. Namely, we could not estimate the hyperparameters and as reand when adaptation liable as the hyperparameters data are as little as three adaptation utterances. To investigate the recognition performance versus the number of adaptation data, we also carry out two cases of LRBPC in the second adaptation task. The adaptation data using 100, 200, 300, and 408 Mandarin syllables are examined. As shown in Fig. 2, the LRBPC with general linear regression is better than that with single variable linear regression. One reason is that the parameters of general linear regression are reliably calculated for sufficient adaptation data. Hereafter, we only list the recognition rates of LRBPC in case of single variable linear regression. D. Plug-in MAP Decision With MLLR Versus BPC Decision With LRBPC Also, it is important to realize how robust the BPC decision is compared with the plug-in MAP decision. In Fig. 3, we compare the WERs of MLLR and LRBPC in the first adaptation task. No matter what noise conditions are specified, the WERs of LRBPC are significantly lower than those of MLLR. For example, in standby noise condition, the LRBPC can achieve WER of 9.8%, which is much lower than 12.9% using MLLR. Such promising result reflects the superiority of BPC decision over plug-in MAP decision. Correspondingly, it reveals that the regression mapping between adaptation data and HMMs does exist the substantial uncertainty which the BPC decision could deal with. In case of single variable linear regression, the total number of parameters using LRBPC (2 for mean hyperparameters and 2 for variance hyperparameters) is double so as to model the parameter randomness. Using MLLR, the number of parameters is 2 for regression parameters. In the second adaptation task, we still see the improvement of LRBPC over MLLR for various numbers of adaptation data. In case of 100 adaptation syllables, the recognition rates are improved from 86.7%
CHIEN: LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION FOR SPEECH RECOGNITION
75
Fig. 1. Comparison of word error rates of baseline system, LRBPC with single variable linear regression and LRBPC with general linear regression under various driving conditions. The first adaptation task is evaluated.
Fig. 2. Comparison of recognition rates of MLLR, LRBPC with single variable linear regression and LRBPC with general linear regression for various numbers of adaptation data. The second adaptation task is evaluated.
using MLLR to 90.5% using LRBPC. The recognition performance of MLLR and LRBPC is continuously improved when increasing number of adaptation data. E. Comparison of Recognition Performance and Speed for Various Methods To conduct a comprehensive comparison, we further carry out CMS, MAPLR, and TBPC and show their WERs in Fig. 4. The first adaptation task is evaluated. The performance of CMS is better than that of baseline but worse than that of linear regression methods. Linear regression model is illustrated to be effective for robust speech recognition. Also, MAPLR attains better
performance than MLLR because the initial hyperparameters estimated from training data provides the helpful prior knowledge for linear regression adaptation. Comparing MALPR and LRBPC, the LRBPC consistently obtains lower WERs. This shows again the advantage of BPC decision over plug-in MAP decision. Notably, MAPLR needs the hyperparameters to provide prior knowledge as well as the regression parameters to be plugged in MAP decoder. The total number of parameters is 6 (2 for mean hyperparameters, 2 for variance hyperparameters and 2 for regression parameters), which is the largest among different methods. Regarding the issue of transformation function, we find that LRBPC is superior to TBPC. It is be-
76
Fig. 3.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 1, JANUARY 2003
Comparison of word error rates of baseline system, MLLR and LRBPC under various driving conditions. The first adaptation task is evaluated.
Fig. 4. Comparison of word error rates of baseline system, CMS, MLLR, MAPLR, TBPC and LRBPC under various driving conditions. The first adaptation task is evaluated.
cause LRBPC adopts the affine transformation, which is more flexible to cover environmental mismatch than the bias transformation used in TBPC decision. For the case of standby condition, the WERs of baseline, CMS, MLLR, MAPLR, TBPC, and LRBPC are listed in Table I. The recognition speeds of various methods are also compared in this table. Here, the averaged recognition speeds are measured in second per utterance through simulating the algorithms on a PENTIUM III 450 PC. We neglect the computation costs for estimating the regression parameters in MLLR and MAPLR and the hyperparameters in TBPC and LRBPC. It can be seen that BPC decision spends
extra computation overhead to calculate the predictive pdf compared to the standard observation pdf calculated in plug-in MAP decision of (6). Since the numerical complexity of frame-based predictive pdf in LRBPC is higher than that in TBPC, the recognition efficiency of LRBPC is inferior to that of TBPC. We also find that the extra cost for implementing LRBPC in case of general linear regression is considerable. In standby condition, the averaged signal-to-noise ratio over all test utterances is around 8 dB. The LRBPC can achieve the best WER 9.8% which is substantially better than 20.9% of baseline system. This recognition result is desirable for the connected Chinese digit task.
CHIEN: LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION FOR SPEECH RECOGNITION
77
TABLE I COMPARISON OF WORD ERROR RATES (%) AND RECOGNITION SPEEDS (SECOND PER UTTERANCE) OF BASELINE SYSTEM, CMS, MLLR, MAPLR, TBPC, LRBPC WITH SINGLE VARIABLE LINEAR REGRESSION AND LRBPC WITH FULL LINEAR REGRESSION FOR THE CASE OF STANDBY CONDITION. THE 1ST ADAPTATION TASK IS EVALUATED
V. CONCLUSION This paper has presented a novel LRBPC decision rule combining the advantages of linear regression adaptation and BPC decision for robust speech recognition. Linear regression adaptation was effective in mapping trained HMMs and test data. BPC decision aimed at resolving the inevitable ill-posed environmental mapping. We established a framework of LRBPC decision in the context of HMMs. The optimal word sequence was recognized according to the predictive distribution of a whole utterance where a prior density of regression parameters was incorporated to shape the uncertainty of regression mapping. This predictive distribution could be approximated through a Viterbi decoder where the observation pdf was replaced by a frame-based predictive pdf. By properly defining the regression uncertainty using a joint Gaussian density, we precisely derived a new Gaussian density to serve as the predictive pdf. Because of the attractiveness using joint Gaussian density in conjugate prior family, the evolutionary mechanism of hyperparameters was available. In this study, the cases of general linear regression and single variable linear regression were considered to develop LRBPC decision. The new decision should be superior to the MLLR adaptation where the estimated regression parameter was viewed as a point estimate and plugged in conventional MAP decision. After a series of speech recognition experiments on Chinese connected digits and isolated Mandarin syllables, we found that the LRBPC decision improved the recognition performance significantly. Using sufficient adaptation data, the case of general linear regression performed better than that of single variable linear regression. The LRBPC decision achieved better performance than MLLR and MAPLR due to the robustness of BPC decision rule. Also, the LRBPC decision outperformed TBPC decision because of the superiority of the affine transformation in LRBPC over the bias transformation in TBPC. The LRBPC decision spent extra cost to compute the sophisticated predictive pdf compared to conventional methods computing the standard observation pdf. In the future, we will apply the LRBPC decision in large vocabulary continuous speech recognition (LVCSR) system. For practical applications, the hyperparameters for LRBPC decision will be learnt in unsupervised manner.
DERIVATION
OF
APPENDIX A PREDICTIVE PDFS FOR TWO CASES LINEAR REGRESSION
OF
In case of general linear regression, we intend to work out the integral in (13). Because the first term in the exponent can be rewritten by
(26) and the following equation can be verified
(27) and defined in (16)–(18), we may with variables , derive the frame-based predictive pdf by
(28) In (28), the term located in the integral is a standard Gaussian function of which is integrates to unity. The frame-based predictive pdf is finally derived and shown in (15). Notably, the and could be full matrices in the covariance matrices derived predictive pdf.
78
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 11, NO. 1, JANUARY 2003
On the other hand, the single variable linear regression involves the calculation of double integral in (11). First of all, we in a form of find out the conditional pdf
(29) It is a Gaussian function of
with conditional mean
Herein, the outer integral is resolved using the same trick as the inner integral. A Gaussian function of is generated and integrated to unity. Notably, it is interesting that the derived predicwith the tive pdf comes out to be a Gaussian distribution of and the variance shown in (19) and (20), respecmean tively. APPENDIX B DERIVATION OF EVOLUTIONARY HYPERPARAMETERS FOR SINGLE VARIABLE LINEAR REGRESSION In case of single variable linear regression, the logarithm of for cluster and dimension is posterior pdf expressed by
and conditional variance
This result can be referred from [20]. Then, the inner integral in (11) yields
(34) and with verify the identity
defined in Section III-C. Also, we can
(30) By using the algebra (35) (31) for arbitrary real numbers , another Gaussian function of inner integral becomes
,
where
, , , we may arrange and integrate it to unity. The
(36) By discarding the terms independent of parameter terior pdf in (34) is formulated by [8]
, the pos-
(32)
and taking an Multiplying (32) with integral, we may finally derive the univariate frame-based preproportional to dictive pdf (37)
(33)
We can see that the posterior pdf is expressed with a joint with 2 1 mean vector Gaussian distribution of and 2 2 covariance matrix respectively given by (24) and (25). The existing hyperparameters are evolved into new hyperparameters to track the latest parameter uncertainty containing in data . Basically, the derivation of evolutionary hyperparameters in (22) and (23) for general linear regression is analogous to that for single variable linear regression.
CHIEN: LINEAR REGRESSION BASED BAYESIAN PREDICTIVE CLASSIFICATION FOR SPEECH RECOGNITION
REFERENCES [1] J. O. Berger, Statistical Decision Theory and Bayesian Analysis, 2nd ed. New York: Springer-Verlag, Inc., 1985. [2] C. Che, N. Wang, M. Huang, H. Huang, and F. Seide, “Development of the Philips 1999 Taiwan Mandarin benchmark system,” in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), vol. 2, 1999, pp. 803–806. [3] C. Chesta, O. Siohan, and C.-H. Lee, “Maximum a posterior linear regression for hidden Markov model adaptation,” in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), vol. 1, 1999, pp. 211–214. [4] J.-T. Chien and H.-C. Wang, “Telephone speech recognition based on Bayesian adaptation of hidden Markov models,” Speech Commun., vol. 22, no. 4, pp. 369–384, 1997. [5] J.-T. Chien and G.-H. Liao, “Transformation-based Bayesian predictive classification using online prior evolution,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 399–410, May 2001. [6] J.-T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 656–667, Nov. 1999. [7] W. Chou, “Maximum a posterior linear regression with elliptically symmetric matrix variate priors,” in Proc. Eur. Conf. Speech Communication and Technology (EUROSPEECH), vol. 1, 1999, pp. 1–4. [8] M. H. DeGroot, Optimal Statistical Decisions. New York: McGrawHill, 1970. [9] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp. 1–38, 1977. [10] V. Digalakis, D. Rtischev, and L. G. Neumeyer, “Speaker adaptation using constrained estimation of Gaussian mixtures,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 357–366, Sept. 1995. [11] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust. Speech Signal Processing, vol. 29, pp. 254–272, Apr. 1981. [12] Q. Huo, H. Jiang, and C.-H. Lee, “A Bayesian predictive classification approach to robust speech recognition,” in IEEE Proc. Int. Conf. Acoustic, Speech and Signal Processing (ICASSP), 1997, pp. 1547–1550. [13] C.-H. Lee and Q. Huo, “On adaptive decision rules and decision parameter adaptation for automatic speech recognition,” Proc. IEEE, vol. 88, pp. 1241–1269, Aug. 2000. [14] C.-H. Lee, L. R. Rabiner, R. Pieraccini, and J. G. Wilpon, “Acoustic modeling for large vocabulary speech recognition,” Comput. Speech Lang., vol. 4, pp. 127–165, 1990. [15] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, pp. 171–185, 1995. [16] H. Jiang, K. Hirose, and Q. Huo, “Robust speech recognition based on Viterbi Bayesian predictive classification,” in IEEE Proc. Int. Conf. Acoustic, Speech, Signal Processing (ICASSP), 1997, pp. 1551–1554. , “Improving Viterbi Bayesian predictive classification via sequen[17] tial Bayesian learning in robust speech recognition,” Speech Commun., vol. 28, no. 4, pp. 313–326, 1999.
[18] [19] [20] [21] [22] [23] [24]
79
, “Robust speech recognition based on a Bayesian prediction approach,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 426–440, July 1999. , “A minimax search algorithm for robust continuous speech recognition,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 688–694, Nov. 2000. J. M. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and Control. Englewood Cliffs, NJ: Prentice-Hall, 1995, pp. 166–167. N. Merhav and C.-H. Lee, “A minimax classification approach with application to robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 90–100, Jan. 1993. A. Nadas, “Optimal solution of a training problem in speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 326–329, Feb. 1985. B. M. Shahshahani, “A Markov random field approach to Bayesian speaker adaptation,” IEEE Trans. Speech Audio Processing, vol. 5, pp. 183–191, Mar. 1997. A. C. Surendran and C.-H. Lee, “Predictive adaptation and compensation for robust speech recognition,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), vol. 2, 1998, pp. 463–466.
Jen-Tzung Chien (S’97–A’98–M’99) was born in Taipei, Taiwan, R.O.C., on August 31, 1967. He received the Ph.D. degree in electrical engineering from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 1997. He was an Instructor in the Department of Electrical Engineering, NTHU, in 1995. In 1997, he joined the Department of Computer Science and Information Engineering, National Cheng Kung University (NCKU), Tainan, Taiwan, where he was an Assistant Professor. Since 2001, he has been an Associate Professor in NCKU. He was the Visiting Researchers at the Speech Technology Laboratory, Panasonic Technologies Inc., Santa Barbara, CA, in 1998, and the Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan, in 2002. His research interests include statistical pattern recognition, speech recognition, speaker adaptation, face detection/recognition/verification, document classification, language modeling, information retrieval, microphone array beamforming, adaptive signal processing, and multimedia/multimodel human–computer interaction. He serves on the board of the International Journal of Speech Technology Dr. Chien is member of the IEEE Signal Processing Society, the International Speech Communication Association, the Chinese Image Processing and Pattern Recognition Society, and the Association for Computational Linguistics and Chinese Language Processing (ACLCLP). He serves on the board of ACLCLP. He is listed in Who’s Who in the World, 2002 edition.