Deep Neural SVM for Speech Recognition

Deep Neural SVM for Speech Recognition

Shi-Xiong Zhang, Chaojun Liu, Kaisheng Yao and Yifan Gong Microsoft, Redmond

April 21, 2015, ICASSP 1

Outline Motivation • Neural Networks

Support Vector Machines

Deep Neural SVM • Max-Margin Training (Frame-level & Sequence-level) • Decoding Experiments

2

Motivation

3

Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙

4

Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙 = 𝑖

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 )

5

Support Vector Machines

Neural Networks

𝑘(𝒙, 𝒙𝑖 )

𝒉(𝒙) 𝒙

𝒙 𝑦=

𝑦

𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖

𝑘(𝒙, 𝒙𝑖 )

Too many local minima Tend to overfit Fixed model size

Convex Max Margin Automatic model size (support vectors)

Deeeep!

Deep? 6

NN

⇒

SVM

⇒

⇒ Deep NN

⇒

Deep SVM

Deep Neural SVM 7

Softmax layer of DNN

Output of softmax

𝑃 𝑠𝑡 𝒐𝑡 =

exp(𝐰𝑠T𝑡 𝒉𝑡 ) T ∑𝑁 𝑠𝑡 =1 exp(𝐰𝑠𝑡 𝒉𝑡 )

Normalization term is independent of states

8

Softmax in DNN

Classification in SVM

𝐰𝑗T 𝝓(𝒐𝑡 )

𝐰1T 𝝓(𝒐𝑡 )

Classification in DNN

Classification in Multiclass SVM

9

The architecture of Deep Neural SVMs

j

… …

… …

3

…

2 1

DNN

Multiclass SVM

time sequence

Sequence SVM 10

Frame-level Max-Margin Training j

… …

… …

3

…

2 1

DNN

Multiclass SVM 11

1st Step: training last layer j

…

Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕

… … …

3

…

2

𝒉𝑡

1

12

1st Step: training last layer ≡ Multiclass SVM j

…

Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕

… … …

3

…

2

𝒉𝑡

1

13

2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡

…

… 𝓷

T

𝜕𝒉𝑡 𝜕𝓃

… … … 𝒉𝑡

14


…

… 𝓷

…

Important!

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN 

… … 𝒉𝑡

15


…

… 𝓷

…

Important!

T


… … 𝒉𝑡

Only the support vectors have gradient!  16

Sequence-level Max-Margin Training j

… … … …

3

…

2 1

DNN

Sequence SVM

time sequence

17


…

… …

log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )

…

3

…

2

𝒉𝑡

reference state sequence 𝑺

Margin

1

competing state sequence 𝑺

18


…

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)

… …

log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )

…

3

…

2

𝒉𝑡

reference state sequence 𝑺

Margin

1

competing state sequence 𝑺

19


…

…

𝐰𝑠T𝑡 𝒉𝑡

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇

…

𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )

…

3

…

2

𝒉𝑡

𝑡=1

1

20

1st Step: training last layer 𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) states j

𝐰𝑗T 𝒉𝑡

𝑇


j

𝑡=1

…

6

…

5

log 𝑃(𝑠𝑡 |𝑠𝑡−1 )

…

4

…

3

3

…

2

2

𝒉𝑡

1

𝐰1T 𝒉𝑡

1

time

1st Step: training last layer ≡ Struct SVM

states j

𝐰𝑗T 𝒉𝑡

j

…

6

…

5

each path defines a feature space 𝝓 𝐎, 𝑺

𝑤𝑖𝑗 log 𝑃(𝑠𝑡 |𝑠𝑡−1 )

…

4

…

3

3

…

2

2

𝒉𝑡

1

𝐰1T 𝒉𝑡

1

time


…

… 𝓷

…

T


𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)

… … 𝒉𝑡

23


…

… 𝓷

…

Important!

T


… … 𝒉𝑡

Margin

Only the support vectors have gradient! 

24

Decoding

25

Decoding of Deep Neural SVMs states 𝐰𝑗T 𝒉𝑡

j

…

6

…

5

𝑤𝑖𝑗 log 𝑎𝑖𝑗

…

4

…

3

𝐰2T 𝒉𝑡

… 𝒉𝑡

𝐰1T 𝒉𝑡

2 1

time 26

Experiments

27

Experiments TIMIT: Continuous Phone Recognition (3 states for each 61 monophones) GMM ML

DNN CE

31.0%

22.9%

DNSVM Frame Max Margin Sequence Max Margin 22.0%

21.0%

8% improvement over DNN

28

Experiments

DNSVM (frame-level) Last layer only 22.03%

+previous layers 21.95%

• Most of gains come from last layer

DNSVM (Sequence-level) acoustic only joint learn with LM 21.38%

21.04%

• Joint learn with LM yields 1.6% improvement

29

Conclusion

30

Conclusion

Future work: Deep Neural SVM

31

Complementary

32

Support Vector Machines (binary / multiclass / sequence) Kernel function

𝐰T𝝓 𝒙 =


𝝓(𝒙)

𝑘 𝒙, 𝒙𝑖 = 𝝓 𝒙 T 𝝓 𝒙𝑖 𝒙

𝒙 𝑦 = 𝐰 T 𝝓(𝒙)

𝑦=


𝑘(𝒙, 𝒙𝑵 )

33

The architecture of Deep Neural SVM states j

j

…

6

…

5

…

4

…

3

3

…

2

2

1

1

time

DNN Sequence SVM

34

Frame-level max-margin training (previous layers)

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡

…

… 𝓷

T


… … … 𝒉𝑡

1 ℱ= 𝐰 2

𝑇 2 2

1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡

+𝐶 𝑡=1

𝑠𝑡

2 +

35


𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡

…

… 𝓷

T

…

Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2


𝑇 2 2


+𝐶 𝑡=1

𝑠𝑡

2 +

36


𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡

…

… 𝓷

T

…

Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2


𝑇 2 2


+𝐶 𝑡=1

𝑠𝑡

2 +

37


𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡

…

… 𝓷

T

…

Important!

… … 𝒉𝑡

1 ℱ= 𝐰 2


𝑇 2 2


+𝐶 𝑡=1

𝑠𝑡

2 +



1 ℱ= 𝐰 2

𝑇 2 2


+𝐶 𝑡=1

𝑠𝑡

2 +

𝓷

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡

T

Important!

𝜕𝒉𝑡 𝜕𝓃 Same as DNN

39


1 ℱ= 𝐰 2

𝑇 2 2


+𝐶 𝑡=1

𝑠𝑡

2 +

𝓷

𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡

Key

T

𝜕𝒉𝑡 𝜕𝓃 Same as DNN

40


…

… 𝓷

…

Important!

T


… … 𝒉𝑡

𝜕ℱ = 2𝐶 𝜕𝒉𝑡

𝑇

1 + 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡 𝑡=1

+

(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )


Sequence-level max-margin training (last layer)

j

…

Maximize the Margin S.T. score of r𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝑺𝟏:𝑻 ≥ all competing 𝑺𝟏:𝑻

… … …

3

…

2

𝒉𝑡

1

42


…

… 𝓷

…

Important!

T


… … 𝒉𝑡

𝜕ℱ = 2𝐶 𝜕𝒉𝑡

𝑇

ℒ + PathScore𝑆 − PathScore𝑺

+

(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )

𝑡=1


Sequence-level max-margin training (last layer) ≡ Struct SVM

j

…

…

𝐰𝑠T𝑡 𝒉𝑡

𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺≠𝑺 𝑺≠𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇

…


…

3

…

2

𝒉𝑡

𝑡=1

𝑻 1

𝐰 T 𝝓 𝐎, 𝑺 = 𝒕=𝟏

𝐰1 ⋮ 𝐰𝑁 −1 +1

T

𝛿 𝑠𝑡 = 1 𝒉𝑡 ⋮ 𝛿 𝑠𝑡 = 𝑁 𝒉𝑡 log 𝑃(𝑠𝑡 ) log 𝑃(𝑠𝑡 |𝑠𝑡−1 ) 44

Deep Neural SVM for Speech Recognition

Deep Neural SVM for Speech Recognition

Suggest Documents

IRJET- Speech Recognition using SVM

Recurrent Deep Neural Networks for Robust Speech Recognition ...

A network of deep neural networks for distant speech recognition

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Noisy Training for Deep Neural Networks in Speech Recognition

Deep maxout neural networks for speech recognition - IEEE Xplore

Speech Emotion Recognition Using Deep Neural Network ... - Microsoft

Deep Learning for Distant Speech Recognition

Deep word embeddings for visual speech recognition

Convolutional Neural Networks for Distant Speech Recognition

SPEECH RECOGNITION USING NEURAL NETWORKS

Modular Neural Networks for Speech Recognition - KIT

Modular Neural Networks for Speech Recognition - KIT

Modular neural-SVM scheme for speech emotion ... - Semantic Scholar

Deep Learning and SVM Classification for Plant Recognition in ...

Reverberant speech recognition combining deep ... - Semantic Scholar

Continuous Speech Recognition by Linked Predictive Neural ...

(Brain Study) Speech Recognition using Neural Networks.pdf

Speech separation for speech recognition

Malay Isolated Speech Recognition Using Neural

Neural Network Based Hausa Language Speech Recognition

SVM-MLP-PNN Classifiers on Speech Emotion Recognition Field - A ...

Automated Vehicle Recognition with Deep Convolutional Neural

Deep Neural Networks Based Recognition of Plant