LONG SPAN FEATURES AND MINIMUM ... - Semantic Scholar

1 downloads 0 Views 85KB Size Report
Discriminant Analysis (MPE-HLDA), is shown to be more robust than LDA when applied to long-span features and easy to incor- porate with existing training ...
LONG SPAN FEATURES AND MINIMUM PHONEME ERROR HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS Bing Zhang † , Spyros Matsoukas, Jeff Ma, Richard Schwartz BBN Technologies, 50 Moulton St. Cambridge, MA 02138 {bzhang, smatsouk}@bbn.com ABSTRACT In this paper we explore the effect of long-span features, resulting from concatenating multiple speech frames and projecting the resulting vector onto a subspace using Linear Discriminant Analysis (LDA) techniques. We show that LDA is not always effective in selecting the optimal combination of long-span features, and introduce a discriminative feature analysis method that seeks to minimize phoneme errors on training lattices. This technique, referred to as Minimum Phoneme Error Heteroscedastic Linear Discriminant Analysis (MPE-HLDA), is shown to be more robust than LDA when applied to long-span features and easy to incorporate with existing training procedures, such as HLDA-SAT and discriminative training of Hidden Markov Models (HMMs). Results on conversational telephone speech and broadcast news corpora also show that the recognition accuracy is improved using features selected by MPE-HLDA. 1. INTRODUCTION In speech recognition systems, feature analysis is usually employed for better classification accuracy and complexity control. In recent years, extensions to the classical Linear Discriminant Analysis (LDA) [1] have been widely adopted. Among them, Heteroscedastic Discriminant Analysis (HDA) [2] seeks to remove the equal class variance constraint assumed by LDA. In addition, the authors of [2] also applied a Maximum Likelihood Linear Transformation (MLLT) on top of LDA and HDA and improved accuracy. Maximum likelihood based Heteroscedastic Linear Discriminant Analysis (HLDA) [3, 4], generalizes LDA in another way by putting the optimization of feature projections inside an ML parametric estimation framework, fitting the HMM modeling of speech recognition systems better. Despite the differences of the above techniques, they have some common limitations. First, none of them assumes any prior knowledge of confusable hypotheses, so their choices are determined to be suboptimal for recognition. Second, their objective functions do not directly relate to the word error rate (WER), which is often the ultimate goal of speech recognition systems. As a result, it is often unknown whether selected features will do well in testing by just looking at the values of objective functions. For example, we found that HLDA could select totally non-discriminant features while improving its objective function by unifying data samples along some dimensions. Recent research showed that using discriminative criteria like Maximum Mutual Information (MMI) [5] and Minimum Phoneme † Bing Zhang is a Ph.D. student at the College of Computer and Information Science, Northeastern University, Boston, MA 02115

Error (MPE) [6] in optimizing Gaussian parameters improved recognition accuracy significantly. They suggest a direction of optimizing feature projections as well. Inspired from there, we developed a feature analysis approach based on MPE criterion, MPE-HLDA, for better recognition accuracy. In addition, since this criterion is closely related to WER, MPE-HLDA tends to be more robust than others, which makes it potentially better suited for more varieties of features. Under MPE criterion, we could perform EM update of feature projections and Gaussian parameters jointly [7], however, this would tie projections closely with discriminatively placed Gaussians. Instead, we would like the optimization to concentrate on finding better features by running the optimization of feature projections as a standalone process. By doing this we can have more flexibility in using the resulting projections. For instance, we could perform regular ML training with them and still preserve the gains from MPE-HLDA. We could also plug MPE-HLDA into HLDASAT training [8] easily, and our experiments showed improvement in doing that. Before we start describing the details of the MPE-HLDA method, we turn our focus on the use of long-span features, and discuss results and implementation issues in section 2. The MPEHLDA objective function and its implementation are presented in sections 3 and 4. In Section 5, the system configuration of our experiments on conversational telephone speech (CTS) and broadcast news (BN) corpora is described. The results are then compared and analyzed in Section 6. Finally the paper ends with conclusions and suggestions for future research. 2. LONG SPAN FEATURES Cepstral features (MFCC, PLP, etc.) and their derivatives are popular in speech recognition systems. For example, the features typically used in our system start from PLP features concatenated with their first, second, and third-order derivatives. The base features occupy 15 dimensions (14 PLP + 1 energy) per frame, so with derivatives we end up with a 60-dimensional vector. We project this feature vector down to a lower 46-dimensional space using LDA. With this approach, each higher-order derivative incorporates information from a wider window of neighboring frames (up to 4 frames on either side), which seems to help the final model’s discriminative power. With larger amounts of training data available, it is interesting to consider adding information from an even wider context using a more direct method. One approach is to eliminate the use of derivatives and instead concatenate N successive frames of base features (centered at the current frame) and then project this block of features to an m-dimensional feature space using LDA. This

approach has been shown by IBM and other sites to be as good as using derivatives, but our interest is to extend it to allow using a much higher-dimensional input feature space for the LDA in hopes of extracting better final features for recognition. With this method it is much easier to increase the input dimension either in the number of frames or the number of features per frame. 2.1. Dimensionality Issues and Speaker Adaptation One of the main issues with long span features is the increased dimensionality in the original feature space, as the result of concatenating a large number of frames. Training methods that estimate models in the original space will be susceptible to overfitting, unless sufficient amount of training data is available. One such method is HLDA-SAT [8], a Speaker Adaptive Training (SAT) method that tries to estimate a Constrained Maximum Likelihood Linear Regression (CMLLR) [9] transform for each training speaker in the original feature space, before the global projection is applied. Estimation of the speaker dependent transforms is based on a HMM with a single full covariance Gaussian per tied state. To mitigate the dimensionality issue, we chose to constrain the structure of the transform and estimate a sparse (block-diagonal) matrix instead. Do deal with cases of very little data, e.g. while adapting to a speaker (cluster) with only a couple of minutes of speech, we decided to constrain the matrix further and tie all diagonal blocks together. This is equivalent to estimating a low-dimensional CMLLR transform based on the center frames only, and then cloning the transform along the main diagonal to cover the desired frame span. 2.2. Effect of Training Corpus Size The first set of experiments with long span features was conducted using speaker independent (SI) models, trained on the RT03 CTS English training corpus (370 hours). The baseline model was a crossword, quinphone State Clustered Tied Mixture (SCTM) Hidden Markov Model (HMM), with 5 states per quinphone, estimated on PLP cepstral derivatives, spanning a total of 9 frames. HLDA was used to reduce the final dimensionality of the observation vectors down to 46. It’s worth mentioning that the baseline SCTM had approximately 442k Gaussians. Models trained with frame concatenation used LDA to reduce observation vector dimensionality to 46, and then applied MLLT. HLDA was not used in this case due to its high storage and computational cost. It is important to note, however, that even in the baseline derivative-feature HMM there is no significant gain from HLDA compared to LDA+MLLT. Thus we feel comfortable comparing the HLDA derivative baseline with the LDA+MLLT frame concatenated systems. The results are shown in Table 1, where we can see that the use of long span features provides a small gain (about 0.3% absolute) with a span of 11 frames. Increasing the span to 15 frames offers no additional improvement. The next set of experiments was conducted on the full RT04 training corpus, which included an additional 1800 hours of Fisher training data. In addition to increasing the time span, we also experimented with increasing the number of LDA output dimensions. All models in this comparison had approximately 900k Gaussians. The results are summarized in Table 2. It is interesting to see that on the 2300-hour corpus there is a significant gain from frame concatenation (about 1% absolute).

#Frames

Pre-LDA Transform

% WER

9 9 11 15

Derivatives Concatenation Concatenation Concatenation

26.4 26.2 26.1 26.3

Table 1. Unadapted SI decoding results on the 2001 Hub-5 English CTS evaluation test set (Eval01), showing the effect of long span features with models trained on 370 hours of data. #Frames

Pre-LDA Transform

#out dim.

% WER

9 9 15 15 15 21

Derivatives Derivatives Concatenation Concatenation Concatenation Concatenation

46 60 46 60 72 72

26.4 27.0 25.9 25.3 25.2 25.5

Table 2. Unadapted SI decoding results on the RT03 English CTS evaluation test set (Eval03), showing the effect of long span features with models trained on 2300 hours of data. Half of that gain is due to the increased frame span, and another half due to the increased LDA output dimensionality. Note that increasing the LDA output dimensionality to 60 in the baseline derivative-feature system results in a degradation, rather than a gain. This implies that there is no useful (discriminative) information beyond 46 LDA dimensions when the input stream is composed of the cepstra/energy and their derivative. When we first observed the above results we rushed to conclude that long span features require a large amount of training data during the estimation of the LDA transform. To verify this hypothesis, we ran an experiment in which the LDA was estimated from 2300 hours of training, while the MLLT and Gaussians were estimated from 370 hours. This system was compared to the 370hour baseline, as shown in Table 3. LDA training size (hours)

Model training size (hours)

% WER

370 2300

370 370

26.3 26.4

Table 3. Unadapted SI decoding results on the Hub-5 2001 English CTS evaluation test set (Eval01), showing the effect of amount of training data in the estimation of LDA. In both cases the final SCTM HMM was trained on 370 hours of training. The results show that there is no benefit from using the extra 1800 hours in the estimation of the LDA transform. After a closer examination of the results in Table 1, we noticed that the MLLT objective function value decreased as we increased the time span in the frame concatenation experiments. Recall that the MLLT objective function is simply the ratio of the HMM likelihood with diagonal covariances to the full covariance HMM likelihood. A decrease in this ratio implies that the diagonal covariance assumption becomes less valid in the resulting feature subspace. The 370-hour system used 440k Gaussians, while the 2300-hour system could afford to estimate 900k Gaussians. Perhaps the increased number

of parameters allowed the 2300-hour system to model the correlations adequately and get the observed gain in accuracy. It would be interesting to try a more detailed semi-tied covariance model on the 370-hour corpus with frame concatenation, or to simply increase the number of Gaussians to see if the accuracy improves. 2.3. Results with SAT and Adaptation To see the performance of frame concatenated models with adaptation, we trained ML SAT models using the block diagonal constraint described in section 2.1. These models where then used to decode the training data with a bigram language model, generating phone-marked lattices. Several iterations of MPE training of the Gaussian parameters were performed on the phone-marked, error-marked lattices, starting from the ML-SAT models. The performance of these models was compared to a similarly estimated derivative-feature HMM, as shown in Table 4. #Frames

Pre-LDA Transform

#out dim.

% WER

9 15

Derivatives Concatenation

46 60

15.4 14.9

ε(wr ) is the “raw accuracy” score of word hypothesis wr as deˆ r ) is the posterior probability of hypothfined in [10]. Pλ (wr | O esis wr in the lattice, computed as follows k ˆ ˆ r ) = P Pλ (Or | wr ) P (wr ) Pλ (wr | O k ˆ wr Pλ (Or | wr ) P (wr )

where P (wr ) is the language model probability of hypothesis wr , and k is an exponent applied to the acoustic scores in order to reduce their dynamic range thereby avoiding the concentration of all posterior mass to the top-1 hypothesis in the lattice. The derivative of (4) with respect to A can be computed using the chaining rule as ˆ λ) ˆ λ) ∂λ ∂FM P E (O, ∂FM P E (O, = ∂A ∂λ ∂A

In a lattice-based framework where hypotheses are represented as sequences of arcs, the first factor is

r

We can see that the gain from 15-frame concatenation was reduced to 0.5% absolute after adaptation and MPE training. It is worth noting that the results of Table 4 where obtained by rescoring a lattice generated with the baseline derivative model, adapting to the previous pass baseline 1-best hypothesis. One could argue that this setup can introduce a system combination benefit in the frame concatenated results, due to cross-adaptation. We are planning to investigate this in more detail in the future.

ˆ qr | qr , λ)  ∂ log Pλ (O · ε (qr ) − α (r) ∂λ 

0

3. MPE OBJECTIVE FUNCTION AND DERIVATIVE If we define Ap×n as a global feature projection matrix that linearly maps n-dimensional original features to p-dimensional ones, where p < n, then Gaussian parameters and features in the pdimensional space can be written as A µm diag(AΣm AT ) A ot

(1) (2) (3)

where µm and Σm are means and variances in the original space. ˆm , µ We will refer to λ = hC ˆm i, the model in reduced feature space, as MPE-HLDA model. Given the definition above, MPE-HLDA aims at minimizing expected number of phoneme errors in the reduced feature space, or equivalently maximizing the function ˆ λ) = FM P E (O,

R X X r

ˆ r ) ε(wr ) Pλ (wr | O

(4)

wr

ˆ r is the sewhere R is the total number of training utterances, O quence of p-dimensional observation vectors in utterance r, and

)

0

(7)

where α0 (r) is the MPE score of utterance r, and ε0 (qr ) is the average accuracy over all hypotheses that contain arc qr . It has been shown in [6] that both ε0 (qr ) and α0 (r) can be computed efficiently via forward/backward passes over error-marked lattices. Within arc qr , we also have ˆ qr | qr , λ) ∂ log Pλ (O = ∂λ Eqr X X t=Sqr

= = =

(6)

R X X ˆ λ) ∂FM P E (O, ˆ qr , λ) =k Pλ (qr | O ∂λ r q

Table 4. Adapted MPE SAT lattice rescoring results on the 2004 English CTS Fisher development test set (Dev04), showing the effect of long span features with models trained on 2300 hours of data.

µ ˆm ˆm C oˆt

(5)

γqr ,m (t)

m

∂ log Pλ (ˆ ot | m, λ) ∂λ

(8)

where Sqr and Eqr are the begin and end time of arc qr , respectively, and γqr ,m (t) denotes the posterior probability of Gaussian m in arc qr . Plug (7) and (8) into (6) to get R X X ˆ λ) ∂FM P E (O, =k D(qr , r) ∂A r q r

·

Eqr X X t=Sqr

m

γqr ,m (t)

∂ log Pλ (ˆ ot | m, λ) ∂A

 (9)

where   ˆ qr , λ) ε0 (qr ) − α0 (r) D(qr , r) = Pλ (qr | O

(10)

Finally, the derivative of log Gaussian likelihood with respect to A in (9) is ∂ log q(ˆ ot | m, λ) = ∂A n h i o −1 −1 ˆm ˆm C C · diag (ˆ ot − µ ˆm )(ˆ ot − µ ˆm )T − Ip AΣm −1 ˆm − C (ˆ ot − µ ˆm )(ot − µm )T

(11)

4. IMPLEMENTATION The most important issue in the implementation of MPE-HLDA is computational cost, because Eq. (11) involves full covariance matrices Σm . In a typical system where the number of Gaussians can be up to 1×105 or more, it is impossible to hold these matrices in memory. On the other hand, frequent I/O access to secondary storage is not desired either during the forward/backward training. For these reasons, one forward/backward pass over the lattices is not sufficient to compute the derivative. However, from Eq. (9) and (11), we observed that the sufficient statistics of the derivative includes the following terms, which are indepedent of Σm . h i XX X Fm = D(qr , r) γqr ,m (t)·diag (ˆ ot − µ ˆm )(ˆ ot − µ ˆm )T r

qr

G=

r

qr

D(qr , r)

XX m

h

−1 ˆm γqr ,m (t) C (ˆ ot − µ ˆm )(ot − µm )T

t

(13) It suggests that we can accumuate Fm and G through lattices and then synthesize the derivative with them and Σm . The full covariance matrices can either be precomputed and saved on disk, or to be computed on the fly after we obtain Fm and G. These two approaches lead to time efficient implementation and storage efficient implementation respectively. In time efficient implementation, µm and Σm are precomputed and saved on disk at the beginning of optimization. Then they can be used in synthesizing MPE derivative and update the model λ, assuming the change of A does not affect the posterior probabilities of each Gaussian given the training examples. Since the synthesis of MPE derivative from Σ, Fm and G is fast and so is the updating of λ, time is spent mainly on lattice based accumulation. On the other hand, in storage efficient implementation, no such full covariance statistics needs to be saved. We can run an additional forward/backward pass on labels to get Σm on the fly if we have computed Fm and G. Also in this case, updating the model can not be done using Eq. (1) and (2), which means that we need another forward/backward pass to compute λ with new estimation of A. In short, it is a tradeoff between time and storage. We implemented the time efficient approach in our experiments. As discussed above, first we compute µm and Σm by running a single forward/backward retraining pass with labeled data, in which a maximum likelihood initial model was used in forward pass to compute posteriors of Gaussians, then those posteriors were used in the backward pass to estimate µm and Σm in the original feature space. We ran in parallel a number of redundant jobs of the single pass retraining, each accumulating statistics for a fraction of all Gaussians, so that the memory usage of ˆm , µ each job could be limited. After that, we compute C ˆm , oˆt based on the current estimate of the projection matrix, A, and run a forward/backward pass over the lattice to accumulate Fm and G, which are then used together with Σm to calculate the derivative. Based on the consideration on flexibility discussed in Section 1, we used gradient descent in updating the projection matrix, instead of doing EM updating of all model parameters. Though commonly thought to be inefficient, in practice we found that the gradient descent based optimization usually converges in less than 20 iterations. Once we have optimized the MPE-HLDA model, we can use it for a second retraining pass, to update µm and Σm . This step is

1. Initialize feature projection matrix A(0) by LDA or HLDA, ˆ (0) by Gaussian splitting. and MPE-HLDA model M 2. Set i ← 1. (i)

3. Compute covariance statistics Σm in the original feature space. (a) Do maximum likelihood update of MPE-HLDA model ˆ (i−1) in the feature space defined by A(i−1) ; M ˆ (i−1) to generate (b) Do single pass retraining using M (i)

(i)

µm and Σm in the original feature space. 4. Optimize the feature projection matrix:

t

(12) XX

especially useful if the initial projection and ML model is far from the optimal configuration. The result is an iterative optimization procedure, as follows:

i

(i)

(a) Set j ← 0, Aj ← A(i−1) . (i)

(i)

(i)

(b) Project µm , Σm using Aj reduced subspace.

ˆ (i) in to get model M j (i)

ˆ to (c) Run forward/backward pass on lattices using M j (i)

get Σm independent statistics for MPE derivative. (i)

(i)

(d) Use Σm , Aj and statistics from 4c to compute MPE (i)

(i)

derivative Dj , and MPE score Fj . (i)

(i)

(i)

(i)

(e) Update Aj to Aj+1 using Dj and Fj . (f) Set j ← j + 1, goto 4b unless convergence or maximum number of iterations is reached. (i) ˆ (i) ← M ˆ (i) , i ← i + 1 5. Optionally, set A(i) ← Aj−1 , M j−1 and go to 3 to repeat.

5. EXPERIMENTAL SETUP We ran experiments on both Conversational Telephone Speech (CTS) and Broadcast News (BN) corpora, as part of the DARPA EARS research project [11]. For CTS, we have approximately 2300 hours of training data, of which we used 800 hours for training the initial ˆ (0) and the remaining as held-out training data for ML model M lattice generation and discriminative training. For MPE-HLDA, only 370 hours of the held-out data were used to reduce training time. Similarly for BN, we used 600 hours of data from Hub4 and ˆ (0) and 330 hours of TDT corpora for training the initial model M held-out data for MPE-HLDA estimation. Experiments were performed both with and without HLDASAT. With HLDA-SAT, a CMLLR-based adaptation was performed first on 15-dimensional (14 cepstral coefficients and normalized energy) Perceptual Linear Predictive (PLP) coefficients, resulting in adapted cepstras with smaller within-class variances, which served as input to MPE-HLDA. Besides the cepstral difference, HLDASAT is totally transparent to MPE-HLDA. Our first experiments used cepstras and their first, second and third derivatives as input feature vectors for HLDA [11], however, recent research suggested that there is more useful information in longer contexts (concatenated frames), especially for the CTS corpus. To reduce the computational cost, we adopted a “two-level projection” approach. Given dimensionality of concatenated features ` and target dimensionality p, we first estimate a projection matrix Ln×` on the first level and keep it fixed throughout the

Iteration 0 (HLDA) 1 (MPE-HLDA) 2 (MPE-HLDA)

MPE-HLDA model tr. WER Eval03 30.5 30.6 29.6 29.3 29.0 28.8

Final ML model Eval03 25.9 25.6 25.4

Table 5. unadapted English CTS 15-frame MPE-HLDA compared to baseline MPE-HLDA optimization. On the second level, we use MPEHLDA to update the projection matrix Ap×n using the procedure described in sections 3 and 4. In usage, A · L forms the global feature projection. ` is determined by the number of frames to concatenate. p and n are chosen based on experience, typically p = 60 and n = 130 or 190 in our experiments. For acoustic modeling, State Cluster Tied Mixtures (SCTM) were used. The SCTM model uses two levels of state tying; at the first level, states are tied based on a decision tree to share Gaussian parameters, while at the second level states share mixture weights for a particular Gaussian cluster (codebook). Each MPE-HLDA model contained only 12 components for each codebook cluster, much smaller than in regular systems, and it was used only for estimating the feature projection. The initial model was trained based on ML criterion with labeled data and an initial feature projection, which was formed by a first-level LDA projection and a second-level HLDA projection. Lattices were generated on held-out training data, and marked with trigram language model scores and arc accuracy scores [6]. The idea of using held-out data and strong language model is to approximate the testing scenario as much as possible. 6. RESULTS 6.1. English CTS Results On CTS corpus, we ran both unadapted and adapted experiments with frame concatenated PLP cepstras. We measured the performance of MPE-HLDA on the EARS 2003 Evaluation test set, Eval03. In unadapted experiments, we used non-crossword state clusters for acoustic modeling, LDA for first-level projection L and HLDA for the second-level initial projection A0 . Table 5 shows the WER from training and decoding with MPEHLDA models and the WER with large ML non-crossword models that were trained using projections before and after optimization. We used 15 frames to concatenate, hence ` = 225, and we chose n = 130 and p = 60 as input and output dimensionality. Two main iterations of joint optimization of the matrix and covariance statistics were performed, using the procedure outlined in section 4. As we can see first, the WER was reduced significantly on the 12 Gaussian-per-state (12GPS) MPE-HLDA model. One may argue that the gains might not come from better features, but from tuning the model to fit MPE criterion. To answer this question, we did two groups of tests with optimized model and projection from iteration 1. 1. We took the optimized MPE-HLDA model from the first iteration of Table 5, and updated the Gaussian parameters with regular ML training. If the optimized feature projection did not select better features at all, there should have

been apparent degradations after ML updating, however, it turned out that the degradation was little (Table 6). 2. We took both initial and optimized (the same one as above) MPE-HLDA models, and updated the Gaussians under MPE criterion. The results showed 0.7% gain after training by using optimized projection (Table 7), which again validated the effect of MPE-HLDA optimization. ML Iter. 0 7

Eval03 WER 29.3 29.4

Table 6. ML updating of optimized MPE-HLDA model (Model 1 of Table 5)

Model 0 (HLDA) 1 (MPE-HLDA)

MPE Iter. 0 5 0 5

Eval03 WER 30.6 25.8 29.3 25.1

Table 7. MPE updating of initial and optimized MPE-HLDA models (Model 0 and 1 of Table 5) These results suggested that it was unlikely just a tuning that happened, instead, the features were indeed better after the optimization. Another interesting observation from Table 5 is that the gains shrink after retraining of larger ML models. This is possible though, because the MPE-HLDA models only use a small number of Gaussians to model the feature space while the retrained models contain much more parameters. We made a few tries with larger MPEHLDA model, but unfortunately we did not get any more benefits. As an alternative of initial projection A0 , a 60 × 130 MLLT matrix 1 was also tried. It turned out that it had as good performance as HLDA but cheaper to run so that we adopted it as initialization method in later experiments. Given the gains from MPE-HLDA, we then used it in conjunction with HLDA-SAT for estimating a global speaker independent feature projection. For best performance, crossword models were used in MPE-HLDA. After the optimal projection was found, we plugged it into HLDA-SAT procedure to build a full system and decoded with Eval03+Dev04 testing set. Table 8 shows the WERs during MPE-HLDA training and from testing. The results verified the improved performance by MPE-HLDA. Projection LDA+MLLT MPE-HLDA LDA+MLLT MPE-HLDA

Model SAT SAT SAT-MPE SAT-MPE

Dev04 17.3 16.9 16.2 15.9

Table 8. adapted English CTS 15-frame MPE-HLDA compared to baseline 1 First a 60 × 60 MLLT transform was estimated in the projected space defined by the first level LDA, then it was padded with zeros to make a 60 × 130 projection.

6.2. Mandarin CTS results On Mandarin corpus, we did not use held-out data in MPE-HLDA optimization, because we only had about 80 hours of training data. However, the results were still promising there. Table 9 shows the WERs on Dev04 testing set, where we compare the performance of SAT models and MPE models using MPE-HLDA projection to those using LDA+MLLT projection. Projection LDA+MLLT MPE-HLDA LDA+MLLT MPE-HLDA

Model SAT SAT SAT-MPE SAT-MPE

H4d04 35.6 34.6 33.3 32.7

Table 9. adapted Mandarin CTS 9-frame MPE-HLDA compared to baseline

6.3. English BN results Table 10 shows the WERs on training data (via lattice annotation) and h4d04 test data. The configuration of MPE-HLDA experiments were the same with adapted experiments on CTS, except that we also tested with increased number of frames to concatenate. Frames 9 9 15 15 23 23

Projection LDA+MLLT MPE-HLDA LDA+MLLT MPE-HLDA LDA+MLLT MPE-HLDA

tr. WER 12.7 12.7 12.7 12.2 13.1 12.1

H4d04 12.8 12.7 12.9 12.7 13.2 12.7

Table 10. Results of English BN SAT models We see two interesting results from the experiments. First, the MPE-HLDA was more robust than LDA+MLLT when the dimensionality became higher. From 9 to 23 frames, LDA+MLLT kept degrading its performance, while MPE-HLDA recovered the loss even from the bad initial point. Second, we chose to use 15 and 23 frames in doing MPE-HLDA because previous experiments at BBN showed that features from 9 frames gained little over features from base cepstras and deltas. On BN corpus, however, results from MPE-HLDA here showed that the use of more than 9 frames also did not provide any more benefits. This fact make us believe that the nature of this corpus is different from that of CTS. It is unclear now what is the exact reason, but what we know is that there are more silences and music between speech in BN data, so that longer concatenation window may hurt from including more irrelevant features. 7. CONCLUSIONS In this paper we have taken a first glance at a new feature analysis method, MPE-HLDA. Its application to both unadapted and speaker adapted training shows that it is effective in reducing recognition errors, and that it is more robust than other commonly used analysis methods like LDA and HLDA.

We believe that MPE-HLDA can be improved further. In future work, we will explore other linear or nonlinear features than frame concatenated ones, and we are also interested in using multiple projections and more efficient modeling in the original feature space. 8. REFERENCES [1] R. Duda, P. Hart, and D. Stork, Pattern Classification, WileyInterscience, 2nd edition, 2000. [2] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum likelihood discriminant feature spaces,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing. IEEE, June 2000, vol. 2, pp. II1 129–II1 132. [3] N. Kumar and A. G. Andreou, “A generalization of linear discriminant analysis in maximum likelihood framework,” Tech. Rep. JHU-CLSP Technical Report No. 16, Johns Hopkins University, Aug. 1996. [4] M. J. F. Gales, “Maximum likelihood multiple subspace projections for hidden markov models,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 2, pp. 37–47, Feb. 2002. [5] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “MMIE training of large vocabulary recognition systems,” Speech Communication 22, pp. 303–314, June 1997. [6] D. Povey and P. C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, 2002. [7] X. Liu and M. J. F. Gales, “Discriminative training of multiple subspace projections for large vocabulary speech recognition,” Tech. Rep. CU Technical Report No. 489, Cambridge University, England, Aug. 2004. [8] S. Matsoukas and R. Schwartz, “Improved speaker adaptation using speaker dependent feature projections,” in Proceedings of Automatic Speech Recognition and Understanding, Virgin Islands, U.S., Nov. 2003, IEEE, pp. 273–278. [9] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Tech. Rep. 291, University of Cambridge, May 1997. [10] D. Povey, “Improved discriminative training for large vocabulary continuous speech recognition,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, 2001. [11] R. Schwartz et al., “Speech recognition in multiple language and domains: the 2003 BBN/LIMSI EARS system,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, Montreal, Quebec, Canada, May 2004, IEEE, vol. III, pp. 753–756.

Suggest Documents