Deep Neural SVM for Speech Recognition

0 downloads 0 Views 2MB Size Report
Neural Networks Support Vector Machines ... Support Vector Machines (binary / multiclass / sequence) ... 1st Step: training last layer ≡ Multiclass SVM. 13. 2. 3.
Deep Neural SVM for Speech Recognition

Shi-Xiong Zhang, Chaojun Liu, Kaisheng Yao and Yifan Gong Microsoft, Redmond

April 21, 2015, ICASSP 1

Outline Motivation • Neural Networks

Support Vector Machines

Deep Neural SVM • Max-Margin Training (Frame-level & Sequence-level) • Decoding Experiments

2

Motivation

3

Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙

4

Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙 = 𝑖

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 )

5

Support Vector Machines

Neural Networks

𝑘(𝒙, 𝒙𝑖 )

𝒉(𝒙) 𝒙

𝒙 𝑦=

𝑦

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖

𝑘(𝒙, 𝒙𝑖 )

Too many local minima Tend to overfit Fixed model size

Convex Max Margin Automatic model size (support vectors)

Deeeep!

Deep? 6

NN



SVM



⇒ Deep NN



Deep SVM

Deep Neural SVM 7

Softmax layer of DNN

Output of softmax

𝑃 𝑠𝑡 𝒐𝑡 =

exp(𝐰𝑠T𝑡 𝒉𝑡 ) T ∑𝑁 𝑠𝑡 =1 exp(𝐰𝑠𝑡 𝒉𝑡 )

Normalization term is independent of states

8

Softmax in DNN

Classification in SVM

𝐰𝑗T 𝝓(𝒐𝑡 )

𝐰1T 𝝓(𝒐𝑡 )

Classification in DNN

Classification in Multiclass SVM

9

The architecture of Deep Neural SVMs

j

… …

… …

3



2 1

DNN

Multiclass SVM

time sequence

Sequence SVM 10

Frame-level Max-Margin Training j

… …

… …

3



2 1

DNN

Multiclass SVM 11

1st Step: training last layer j



Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕

… … …

3



2

𝒉𝑡

1

12

1st Step: training last layer ≡ Multiclass SVM j



Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕

… … …

3



2

𝒉𝑡

1

13

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷

T

𝜕𝒉𝑡 𝜕𝓃

… … … 𝒉𝑡

14

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

15

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

Only the support vectors have gradient!  16

Sequence-level Max-Margin Training j

… … … …

3



2 1

DNN

Sequence SVM

time sequence

17

1st Step: training last layer j



… …

log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )



3



2

𝒉𝑡

reference state sequence 𝑺

Margin

1

competing state sequence 𝑺

18

1st Step: training last layer j



𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)

… …

log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )



3



2

𝒉𝑡

reference state sequence 𝑺

Margin

1

competing state sequence 𝑺

19

1st Step: training last layer j





𝐰𝑠T𝑡 𝒉𝑡

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇



𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )



3



2

𝒉𝑡

𝑡=1

1

20

1st Step: training last layer 𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) states j

𝐰𝑗T 𝒉𝑡

𝑇

𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )

j

𝑡=1



6



5

log 𝑃(𝑠𝑡 |𝑠𝑡−1 )



4



3

3



2

2

𝒉𝑡

1

𝐰1T 𝒉𝑡

1

time

1st Step: training last layer ≡ Struct SVM

states j

𝐰𝑗T 𝒉𝑡

j



6



5

each path defines a feature space 𝝓 𝐎, 𝑺

𝑤𝑖𝑗 log 𝑃(𝑠𝑡 |𝑠𝑡−1 )



4



3

3



2

2

𝒉𝑡

1

𝐰1T 𝒉𝑡

1

time

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



T

𝜕𝒉𝑡 𝜕𝓃

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)

… … 𝒉𝑡

23

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

Margin

Only the support vectors have gradient! 

24

Decoding

25

Decoding of Deep Neural SVMs states 𝐰𝑗T 𝒉𝑡

j



6



5

𝑤𝑖𝑗 log 𝑎𝑖𝑗



4



3

𝐰2T 𝒉𝑡

… 𝒉𝑡

𝐰1T 𝒉𝑡

2 1

time 26

Experiments

27

Experiments TIMIT: Continuous Phone Recognition (3 states for each 61 monophones) GMM ML

DNN CE

31.0%

22.9%

DNSVM Frame Max Margin Sequence Max Margin 22.0%

21.0%

8% improvement over DNN

28

Experiments

DNSVM (frame-level) Last layer only 22.03%

+previous layers 21.95%

• Most of gains come from last layer

DNSVM (Sequence-level) acoustic only joint learn with LM 21.38%

21.04%

• Joint learn with LM yields 1.6% improvement

29

Conclusion

30

Conclusion

Future work: Deep Neural SVM

31

Complementary

32

Support Vector Machines (binary / multiclass / sequence) Kernel function

𝐰T𝝓 𝒙 =

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖

𝝓(𝒙)

𝑘 𝒙, 𝒙𝑖 = 𝝓 𝒙 T 𝝓 𝒙𝑖 𝒙

𝒙 𝑦 = 𝐰 T 𝝓(𝒙)

𝑦=

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖

𝑘(𝒙, 𝒙𝑵 )

33

The architecture of Deep Neural SVM states j

j



6



5



4



3

3



2

2

1

1

time

DNN Sequence SVM

34

Frame-level max-margin training (previous layers)

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷

T

𝜕𝒉𝑡 𝜕𝓃

… … … 𝒉𝑡

1 ℱ= 𝐰 2

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

35

Frame-level max-margin training (previous layers)

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷

T



Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

36

Frame-level max-margin training (previous layers)

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷

T



Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

37

Frame-level max-margin training (previous layers)

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷

T



Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

Only the support vectors have gradient!  38

Frame-level max-margin training (previous layers)

1 ℱ= 𝐰 2

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

𝓷

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡

T

Important!

𝜕𝒉𝑡 𝜕𝓃 Same as DNN

39

Frame-level max-margin training (previous layers)

1 ℱ= 𝐰 2

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

𝓷

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡

Key

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN

40

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

𝜕ℱ = 2𝐶 𝜕𝒉𝑡

𝑇

1 + 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡 𝑡=1

+

(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )

Only the support vectors have gradient!  41

Sequence-level max-margin training (last layer)

j



Maximize the Margin S.T. score of r𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝑺𝟏:𝑻 ≥ all competing 𝑺𝟏:𝑻

… … …

3



2

𝒉𝑡

1

42

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡



… 𝓷



Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

𝜕ℱ = 2𝐶 𝜕𝒉𝑡

𝑇

ℒ + PathScore𝑆 − PathScore𝑺

+

(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )

𝑡=1

Only the support vectors have gradient!  43

Sequence-level max-margin training (last layer) ≡ Struct SVM

j





𝐰𝑠T𝑡 𝒉𝑡

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺≠𝑺 𝑺≠𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇



𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )



3



2

𝒉𝑡

𝑡=1

𝑻 1

𝐰 T 𝝓 𝐎, 𝑺 = 𝒕=𝟏

𝐰1 ⋮ 𝐰𝑁 −1 +1

T

𝛿 𝑠𝑡 = 1 𝒉𝑡 ⋮ 𝛿 𝑠𝑡 = 𝑁 𝒉𝑡 log 𝑃(𝑠𝑡 ) log 𝑃(𝑠𝑡 |𝑠𝑡−1 ) 44