Discriminative training for Automatic Speech Recognition using the ...

0 downloads 0 Views 1MB Size Report
between MCE, MPE and MMI on Wall Street Journal task. Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error ...
Discriminative training for Automatic Speech Recognition using the Minimum Classification Error framework Erik McDermott NTT Communication Science Labs NTT Corporation [email protected] Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.1

Motivation & Overview Need for discriminative training Overcome modeling limitations Focus on directly improving recognition performance Overview: MCE fundamentals Smoothed error rate → parallel with large margin training MCE vs. Maximum Likelihood results for large-scale speech recognition tasks

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.2

MCE training for generic models error: 0 if best = k, 1 otherwise decision = best category

Training pattern x from category k

max Recognition system (parameters: Λ)

new Λ = Λ + ∆Λ

score(x, Λ, category 1) . . score(x, Λ, category i) . score(x, Λ, category M)

d_k(x, Λ) = best incorrect score - correct score loss = sigmoid(d_k(x, Λ)) use d_loss/d_Λ to define ∆Λ Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.3

Searching for the Bayes classifier Bayes decision rule: decide Ci if P (Ci |x) > P (Cj |x) for all j 6= i In principle, the same optimal error can be attained using discriminant functions: decide Ci if gi (x, Λ) > gj (x, Λ) for all j 6= i

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.4

Maximum Likelihood fails to separate! Estimated pdfs for two uniformly distributed categories

joint probability density

0.25

0.2

0.15

0.1

0.05

0 0

5

10

15

observation space Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.5

MCE Misclassification Measure The m.m. compares correct and best incorrect categories: dk (xT1 , Λ) = −gk (xT1 , Λ) + max gj (xT1 , Λ) j6=k

dk (xT1 , Λ) < 0 → correct classification, and dk (xT1 , Λ) ≥ 0 → incorrect classification.

Special case of continuous definition (Chou, 1992): 

 ψ1

X T 1 T T g ( x j 1 ,Λ)ψ  dk (x1 , Λ) = −gk (x1 , Λ) + log  e M −1 j6=k

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.6

Discriminant function for HMMs Defined using best Viterbi path Θj : gj (xT1 , Λ) = log P (Sj ) +

T X t=1

log aθj

j t−1 θt

+

T X

log bθj (xt ) t

t=1

Input: sequence of feature vectors, xT1 = (x1 , ..., xT )

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.7

String-level HMM discriminant function speech model

State sequence θt determined by DP procedure over entire string

...

STRING-LEVEL DISCRIMINANT FUNCTION

. . . . + . . . . + log ba2 (x8 ) + . . . . + log b θ t (xt ) + . . . = gj (xT1 , Λ ) feature vector xt=f(yt ,Φ)

feature vectors calculated from speech samples

Feature vector

Shifting time window

m

a

e

d

DP-hypothesized segmentation

a

yt speech frame

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.8

MCE loss function Reflects classification success/failure: `(dk (xT1 , Λ))

=

1 T −αd ( x k 1 ,Λ) 1+e

MCE loss

1 0.8 0.6 0.4 0.2 0 −10

−5

0

−g

correct

5

10

+g

best incorrect

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.9

Overall loss and optimization A practical definition of overall loss is the Average Empirical Cost: M X Nk X 1 L(Λ) = `k (xik , Λ) N k

i=1

E.g.: Quickprop (Fahlman, 1988) can be seen as a modified Newton’s method: use a second order Taylor expansion to model L(Λ): 1 t 2 L(Λ + s) ≈ M (Λ + s) = L(Λ) + ∇L(Λ) s + s ∇ L(Λ)s 2 t

calculate the step size that moves to the minimum of the model Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.10

Actual classification error

1

Empirical Cost

0.8 0.6 0.4 100 0.2 80 0 100

60 80

40 60 40

20 20

m2 x 0.01

0

0

m1 x 0.01

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.11

Smoothed MCE loss function

1

Empirical Cost

0.8 0.6 0.4 100 0.2 80 0 100

60 80

40 60 40

20 20

m2 x 0.01

0

0

m1 x 0.01

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.12

MCE & MMI Unified framework for MCE & MMI (Schlueter, 1998): 1 F (Λ) = R

R X r=1

f

pΛ (Xr |Sr )ψ P (Sr )ψ log P ψ ψ S∈Mr pλ (Xr |S) P (S)

!

Optimization : Gradient-based methods Extended Baum-Welch algorithm; see Kanevsky, 1995 → see Macherey et al., Eurospeech 2005 for application to MCE MMI (= Cross-entropy) fails to separate: Gopalakrishnan et al. ICASSP 1988: “Decoder selection based on cross-entropies” Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.13

Minimum Phone/Word Error MPE, MWE (Povey, 2002): P R ψ P (S)ψ G(S, S ) X p (X |S) 1 r r Λ F (Λ) = log S P ψ P (S)ψ R p (X |S) r λ S r=1

cf. unified framework for MCE & MMI (Schlueter, 1998): 1 F (Λ) = R

R X r=1

f

pΛ (Xr |Sr )ψ P (Sr )ψ log P ψ P (S)ψ (X |S) p r λ S∈Mr

!

See Macherey et al., Eurospeech 2005 for comparison between MCE, MPE and MMI on Wall Street Journal task. Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.14

LVQ = an application of MCE! Bayes Decision Boundary

Actual Decision Boundary LVQ2 window

dj(x)

di(x)

X mi

mj

mi (t + 1) = mi (t) − α(t)(x(t) − mi (t)) mj (t + 1) = mj (t) + α(t)(x(t) − mj (t)) Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.15

Gradient along correct & incorrect paths DP segmentation of correct word pau_1 pau_0 a_1 a_0 d_1 d_0 i_1 i_0 a_1 a_0 pau_1 pau_0

25 20 15 10 5 0 0

20

40

60

80

100

120

140

160

180

200

160

180

200

DP segmentation of recognized word 30 pau_1 pau_0 a_1 a_0 r_1 r_0 i_1 i_0 a_1 a_0 t_1 t_0 pau_1 pau_0

25 20 15 10 5 0 0

20

40

60

80

100 120 speech frame

140

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.16

Defining risk in a new domain P(C)p(x|C)

Original pattern space

α(x,Λ)=C 2

α(x,Λ)=C 1

α(x,Λ)=C 3 C2

m = d(x,Λ)

C3

C1

x d(x,Λ)>0 Transformed pattern space

P(C)p(m|C)

m1

0

0

0

m2

m3

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.17

Parzen estimation of risk Estimate of classification risk: sum of integrals (m>0) for each Parzen kernel

0-1 loss 1

c

m

c

kernel center c: transformed data point d(x,Λ)

Parzen estimate of p(m|C): Sum of kernels centered on (transformed) data points

Σ

0

m = d(x,Λ)

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.18

WFST-based MCE Training

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.19

Flexible transcription model [...]:SUZUKI/19.21 [EE]:eps

2

[AA]:eps

0

[EETO]:eps [ANO]:eps

sil:eps

1

3

[...]:NAOMI

[...]:NAOMI

4

[...]:SUZUKI/19.21

Define desired output as a set of strings, rather than a single string. Regular grammar model, represented as WFST. Use for MCE training: Correct string Sk → correct string set, K Decoder finds best string within set K (with score and segmentation) Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.20

Telephone Based Name Recognition (McDermott et al., ICASSP 2000, 2005) Task: telephone-based, speaker independent, open vocabulary name recognition Approx. 22,000 family & given names modeled Database: > 35,000 training utterances (> 39 hours of audio) Evaluated: 1. ML / Viterbi Training vs. MCE training 2. Use of lattice-derived WFSTs to speed up training 3. “Strict” vs. “Flexible” transcription WFSTs

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.21

ML vs. MCE performance 44

42

40

% Correct

38

36

34

32

MLE / Viterbi training

30

MCE / Triphone Loop

28

26

MCE / LM Lattice 24

0

0.5

1

1.5

2

# Gaussians

2.5

3 4

x 10

Efficient use of parameters via MCE training Huge gains in system compactness Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.22

MCE for JUPITER/SUMMIT system Few MCE results for large, real-world tasks [McDermott, ICASSP 2000] (but MMI results for SWITCHBOARD [Woodland, 2002]). McDermott & Hazen, ICASSP 2004: evaluated application of MCE to MIT’s online weather information system, JUPITER, based on SUMMIT recognition system Basic finding: for fixed real-time factor of 1.0, small models trained with MCE yielded a 20 % relative reduction in word error on in-vocab test set.

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.23

ML vs. MCE experiments on JUPITER 20 ML−75 ML−15 MCE−75 MCE−15

Word Error Rate (%)

19 18 17 16 15 14 13 12

0.8

1 1.2 1.4 Computation − Real Time Factor

1.6

MCE training of large/small models: 41777 gaussian pdfs (MCE-75) vs. 15245 gaussian pdfs (MCE-15); comparison with corresponding ML models. Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.24

Corpus of Spontaneous Japanese Task: lecture speech transcription (Kawahara, 2003) > 180,000 training utterances (≈ 230 hours of audio) ≈ 84,000,000 training vectors of 39 dimensions. 10 test speeches (≈ 2 hours of audio) Le Roux & McDermott, Eurospeech 2005 Evaluated different optimization methods (Quickprop, Rprop, BFGS, Probabilistic Descent) More Recent work: MCE training with 68K unigram WFST (no lattices) MCE training with 100K trigram WFST Testing using 100K trigram LM Relative Word Error Rate reduction ≈ 9-12 % Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.25

Course of training - 100k words 88 Word error: training set Word error: test set

% Word accuracy

86

84

82

80

78

76 0

5

10

15

MCE iterations

Optimization via Rprop Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.26

Recent CSJ results - 1 MCE training with 68k unigram Testing with 30k word trigram Evaluate use of different HMM topologies

# States # Gssns ML-v30k MCE-v30k 2000 16 23.4 22.3 2000 32 22.4 21.0 3000 8 24.1 22.5 3000 16 23.1 20.8 4000 16 22.8 20.8

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.27

Recent CSJ results - 2 Same as before, but test with 100k word trigram

# States # Gssns ML-v100k MCE-v100k 2000 16 23.0 21.1 2000 32 21.7 20.5 3000 8 21.1 3000 16 22.4 20.5 4000 16 22.1 20.1

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.28

Recent CSJ results - 3 Now train with 100k trigram LM Note: training and testing LMs are now matched

# States # Gaussians ML-v100k MCE-v100k 2000 16 23.0 20.2 3000 16 22.4 20.5 4000 16 22.1 20.0

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.29

% Phone class. error (TR)

30

% Phone class. error (TS)

MCE for TIMIT phone classification

30

Online Probabilistic Descent Semi−batch Probabilistic Descent Batch Probabilistic Descent Quickprop Rprop

25 20 15 10 5 0

5

10

15

20 25 Iterations

30

35

40

45

Online Probabilistic Descent Semi−batch Probabilistic Descent Batch Probabilistic Descent Quickprop Rprop

28 26 24 22 0

5

10

15

20 25 Iterations

30

35

40

45

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.30

% Phone class error (TR)

Sensitivity to Quickprop learning rate? LR = 2.5 LR = 1.0 LR = 0.5 LR = 0.25 LR = 0.1 LR = 0.05 LR = 0.025

30 25 20 15 10 0

5

10

15

20

25

30

35

40

45

% Phone class. error (TS)

Iterations LR = 2.5 LR = 1.0 LR = 0.5 LR = 0.25 LR = 0.1 LR = 0.05 LR = 0.025

34 30 26 22 0

5

10

15

20

25

30

35

40

45

Iterations Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.31

Summary MCE incorporates classification performance itself into a differentiable functional form. By directly attacking the problem of interest, parameters are used efficiently. Large gains in performance and model compactness on challenging speech recognition tasks. Telephone-based name recognition MIT JUPITER weather information Corpus of Spontaneous Japanese lecture speech transcription

Discriminative training for Automatic Speech Recognitionusing theMinimum Classification Error framework – p.32