Discriminative training for Automatic Speech Recognition using the Minimum Classification Error framework Erik McDermott NTT Communication Science Labs NTT Corporation
[email protected] Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.1
Motivation & Overview Need for discriminative training Overcome modeling limitations Focus on directly improving recognition performance Overview: MCE fundamentals Smoothed error rate → parallel with large margin training MCE vs. Maximum Likelihood results for large-scale speech recognition tasks
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.2
MCE training for generic models error: 0 if best = k, 1 otherwise decision = best category
Training pattern x from category k
max Recognition system (parameters: Λ)
new Λ = Λ + ∆Λ
score(x, Λ, category 1) . . score(x, Λ, category i) . score(x, Λ, category M)
d_k(x, Λ) = best incorrect score - correct score loss = sigmoid(d_k(x, Λ)) use d_loss/d_Λ to define ∆Λ Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.3
Searching for the Bayes classifier Bayes decision rule: decide Ci if P (Ci |x) > P (Cj |x) for all j 6= i In principle, the same optimal error can be attained using discriminant functions: decide Ci if gi (x, Λ) > gj (x, Λ) for all j 6= i
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.4
Maximum Likelihood fails to separate! Estimated pdfs for two uniformly distributed categories
joint probability density
0.25
0.2
0.15
0.1
0.05
0 0
5
10
15
observation space Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.5
MCE Misclassification Measure The m.m. compares correct and best incorrect categories: dk (xT1 , Λ) = −gk (xT1 , Λ) + max gj (xT1 , Λ) j6=k
dk (xT1 , Λ) < 0 → correct classification, and dk (xT1 , Λ) ≥ 0 → incorrect classification.
Special case of continuous definition (Chou, 1992):
ψ1
X T 1 T T g ( x j 1 ,Λ)ψ dk (x1 , Λ) = −gk (x1 , Λ) + log e M −1 j6=k
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.6
Discriminant function for HMMs Defined using best Viterbi path Θj : gj (xT1 , Λ) = log P (Sj ) +
T X t=1
log aθj
j t−1 θt
+
T X
log bθj (xt ) t
t=1
Input: sequence of feature vectors, xT1 = (x1 , ..., xT )
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.7
String-level HMM discriminant function speech model
State sequence θt determined by DP procedure over entire string
...
STRING-LEVEL DISCRIMINANT FUNCTION
. . . . + . . . . + log ba2 (x8 ) + . . . . + log b θ t (xt ) + . . . = gj (xT1 , Λ ) feature vector xt=f(yt ,Φ)
feature vectors calculated from speech samples
Feature vector
Shifting time window
m
a
e
d
DP-hypothesized segmentation
a
yt speech frame
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.8
MCE loss function Reflects classification success/failure: `(dk (xT1 , Λ))
=
1 T −αd ( x k 1 ,Λ) 1+e
MCE loss
1 0.8 0.6 0.4 0.2 0 −10
−5
0
−g
correct
5
10
+g
best incorrect
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.9
Overall loss and optimization A practical definition of overall loss is the Average Empirical Cost: M X Nk X 1 L(Λ) = `k (xik , Λ) N k
i=1
E.g.: Quickprop (Fahlman, 1988) can be seen as a modified Newton’s method: use a second order Taylor expansion to model L(Λ): 1 t 2 L(Λ + s) ≈ M (Λ + s) = L(Λ) + ∇L(Λ) s + s ∇ L(Λ)s 2 t
calculate the step size that moves to the minimum of the model Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.10
Actual classification error
1
Empirical Cost
0.8 0.6 0.4 100 0.2 80 0 100
60 80
40 60 40
20 20
m2 x 0.01
0
0
m1 x 0.01
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.11
Smoothed MCE loss function
1
Empirical Cost
0.8 0.6 0.4 100 0.2 80 0 100
60 80
40 60 40
20 20
m2 x 0.01
0
0
m1 x 0.01
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.12
MCE & MMI Unified framework for MCE & MMI (Schlueter, 1998): 1 F (Λ) = R
R X r=1
f
pΛ (Xr |Sr )ψ P (Sr )ψ log P ψ ψ S∈Mr pλ (Xr |S) P (S)
!
Optimization : Gradient-based methods Extended Baum-Welch algorithm; see Kanevsky, 1995 → see Macherey et al., Eurospeech 2005 for application to MCE MMI (= Cross-entropy) fails to separate: Gopalakrishnan et al. ICASSP 1988: “Decoder selection based on cross-entropies” Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.13
Minimum Phone/Word Error MPE, MWE (Povey, 2002): P R ψ P (S)ψ G(S, S ) X p (X |S) 1 r r Λ F (Λ) = log S P ψ P (S)ψ R p (X |S) r λ S r=1
cf. unified framework for MCE & MMI (Schlueter, 1998): 1 F (Λ) = R
R X r=1
f
pΛ (Xr |Sr )ψ P (Sr )ψ log P ψ P (S)ψ (X |S) p r λ S∈Mr
!
See Macherey et al., Eurospeech 2005 for comparison between MCE, MPE and MMI on Wall Street Journal task. Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.14
LVQ = an application of MCE! Bayes Decision Boundary
Actual Decision Boundary LVQ2 window
dj(x)
di(x)
X mi
mj
mi (t + 1) = mi (t) − α(t)(x(t) − mi (t)) mj (t + 1) = mj (t) + α(t)(x(t) − mj (t)) Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.15
Gradient along correct & incorrect paths DP segmentation of correct word pau_1 pau_0 a_1 a_0 d_1 d_0 i_1 i_0 a_1 a_0 pau_1 pau_0
25 20 15 10 5 0 0
20
40
60
80
100
120
140
160
180
200
160
180
200
DP segmentation of recognized word 30 pau_1 pau_0 a_1 a_0 r_1 r_0 i_1 i_0 a_1 a_0 t_1 t_0 pau_1 pau_0
25 20 15 10 5 0 0
20
40
60
80
100 120 speech frame
140
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.16
Defining risk in a new domain P(C)p(x|C)
Original pattern space
α(x,Λ)=C 2
α(x,Λ)=C 1
α(x,Λ)=C 3 C2
m = d(x,Λ)
C3
C1
x d(x,Λ)>0 Transformed pattern space
P(C)p(m|C)
m1
0
0
0
m2
m3
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.17
Parzen estimation of risk Estimate of classification risk: sum of integrals (m>0) for each Parzen kernel
0-1 loss 1
c
m
c
kernel center c: transformed data point d(x,Λ)
Parzen estimate of p(m|C): Sum of kernels centered on (transformed) data points
Σ
0
m = d(x,Λ)
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.18
WFST-based MCE Training
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.19
Flexible transcription model [...]:SUZUKI/19.21 [EE]:eps
2
[AA]:eps
0
[EETO]:eps [ANO]:eps
sil:eps
1
3
[...]:NAOMI
[...]:NAOMI
4
[...]:SUZUKI/19.21
Define desired output as a set of strings, rather than a single string. Regular grammar model, represented as WFST. Use for MCE training: Correct string Sk → correct string set, K Decoder finds best string within set K (with score and segmentation) Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.20
Telephone Based Name Recognition (McDermott et al., ICASSP 2000, 2005) Task: telephone-based, speaker independent, open vocabulary name recognition Approx. 22,000 family & given names modeled Database: > 35,000 training utterances (> 39 hours of audio) Evaluated: 1. ML / Viterbi Training vs. MCE training 2. Use of lattice-derived WFSTs to speed up training 3. “Strict” vs. “Flexible” transcription WFSTs
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.21
ML vs. MCE performance 44
42
40
% Correct
38
36
34
32
MLE / Viterbi training
30
MCE / Triphone Loop
28
26
MCE / LM Lattice 24
0
0.5
1
1.5
2
# Gaussians
2.5
3 4
x 10
Efficient use of parameters via MCE training Huge gains in system compactness Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.22
MCE for JUPITER/SUMMIT system Few MCE results for large, real-world tasks [McDermott, ICASSP 2000] (but MMI results for SWITCHBOARD [Woodland, 2002]). McDermott & Hazen, ICASSP 2004: evaluated application of MCE to MIT’s online weather information system, JUPITER, based on SUMMIT recognition system Basic finding: for fixed real-time factor of 1.0, small models trained with MCE yielded a 20 % relative reduction in word error on in-vocab test set.
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.23
ML vs. MCE experiments on JUPITER 20 ML−75 ML−15 MCE−75 MCE−15
Word Error Rate (%)
19 18 17 16 15 14 13 12
0.8
1 1.2 1.4 Computation − Real Time Factor
1.6
MCE training of large/small models: 41777 gaussian pdfs (MCE-75) vs. 15245 gaussian pdfs (MCE-15); comparison with corresponding ML models. Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.24
Corpus of Spontaneous Japanese Task: lecture speech transcription (Kawahara, 2003) > 180,000 training utterances (≈ 230 hours of audio) ≈ 84,000,000 training vectors of 39 dimensions. 10 test speeches (≈ 2 hours of audio) Le Roux & McDermott, Eurospeech 2005 Evaluated different optimization methods (Quickprop, Rprop, BFGS, Probabilistic Descent) More Recent work: MCE training with 68K unigram WFST (no lattices) MCE training with 100K trigram WFST Testing using 100K trigram LM Relative Word Error Rate reduction ≈ 9-12 % Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.25
Course of training - 100k words 88 Word error: training set Word error: test set
% Word accuracy
86
84
82
80
78
76 0
5
10
15
MCE iterations
Optimization via Rprop Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.26
Recent CSJ results - 1 MCE training with 68k unigram Testing with 30k word trigram Evaluate use of different HMM topologies
# States # Gssns ML-v30k MCE-v30k 2000 16 23.4 22.3 2000 32 22.4 21.0 3000 8 24.1 22.5 3000 16 23.1 20.8 4000 16 22.8 20.8
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.27
Recent CSJ results - 2 Same as before, but test with 100k word trigram
# States # Gssns ML-v100k MCE-v100k 2000 16 23.0 21.1 2000 32 21.7 20.5 3000 8 21.1 3000 16 22.4 20.5 4000 16 22.1 20.1
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.28
Recent CSJ results - 3 Now train with 100k trigram LM Note: training and testing LMs are now matched
# States # Gaussians ML-v100k MCE-v100k 2000 16 23.0 20.2 3000 16 22.4 20.5 4000 16 22.1 20.0
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.29
% Phone class. error (TR)
30
% Phone class. error (TS)
MCE for TIMIT phone classification
30
Online Probabilistic Descent Semi−batch Probabilistic Descent Batch Probabilistic Descent Quickprop Rprop
25 20 15 10 5 0
5
10
15
20 25 Iterations
30
35
40
45
Online Probabilistic Descent Semi−batch Probabilistic Descent Batch Probabilistic Descent Quickprop Rprop
28 26 24 22 0
5
10
15
20 25 Iterations
30
35
40
45
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.30
% Phone class error (TR)
Sensitivity to Quickprop learning rate? LR = 2.5 LR = 1.0 LR = 0.5 LR = 0.25 LR = 0.1 LR = 0.05 LR = 0.025
30 25 20 15 10 0
5
10
15
20
25
30
35
40
45
% Phone class. error (TS)
Iterations LR = 2.5 LR = 1.0 LR = 0.5 LR = 0.25 LR = 0.1 LR = 0.05 LR = 0.025
34 30 26 22 0
5
10
15
20
25
30
35
40
45
Iterations Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.31
Summary MCE incorporates classification performance itself into a differentiable functional form. By directly attacking the problem of interest, parameters are used efficiently. Large gains in performance and model compactness on challenging speech recognition tasks. Telephone-based name recognition MIT JUPITER weather information Corpus of Spontaneous Japanese lecture speech transcription
Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.32