Home
Add Document
Sign In
Create An Account
Deep Neural SVM for Speech Recognition
Recommend Documents
No documents
Deep Neural SVM for Speech Recognition
Download PDF
0 downloads
0 Views
2MB Size
Report
Comment
Neural Networks
Support Vector Machines ... Support Vector Machines (binary /
multiclass
/ sequence) ... 1st Step: training last layer â¡
Multiclass
SVM. 13. 2. 3.
Deep Neural SVM for Speech Recognition
Shi-Xiong Zhang, Chaojun Liu, Kaisheng Yao and Yifan Gong Microsoft, Redmond
April 21, 2015, ICASSP 1
Outline Motivation • Neural Networks
Support Vector Machines
Deep Neural SVM • Max-Margin Training (Frame-level & Sequence-level) • Decoding Experiments
2
Motivation
3
Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙
4
Support Vector Machines (binary / multiclass / sequence) SVMs generate linear boundaries in a feature space 𝒙 ⟼ 𝝓 𝒙 𝐲 = 𝐰T𝝓 𝒙 = 𝑖
𝛼𝑖 𝑘(𝒙, 𝒙𝑖 )
5
Support Vector Machines
Neural Networks
𝑘(𝒙, 𝒙𝑖 )
𝒉(𝒙) 𝒙
𝒙 𝑦=
𝑦
𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖
𝑘(𝒙, 𝒙𝑖 )
Too many local minima Tend to overfit Fixed model size
Convex Max Margin Automatic model size (support vectors)
Deeeep!
Deep? 6
NN
⇒
SVM
⇒
⇒ Deep NN
⇒
Deep SVM
Deep Neural SVM 7
Softmax layer of DNN
Output of softmax
𝑃 𝑠𝑡 𝒐𝑡 =
exp(𝐰𝑠T𝑡 𝒉𝑡 ) T ∑𝑁 𝑠𝑡 =1 exp(𝐰𝑠𝑡 𝒉𝑡 )
Normalization term is independent of states
8
Softmax in DNN
Classification in SVM
𝐰𝑗T 𝝓(𝒐𝑡 )
𝐰1T 𝝓(𝒐𝑡 )
Classification in DNN
Classification in Multiclass SVM
9
The architecture of Deep Neural SVMs
j
… …
… …
3
…
2 1
DNN
Multiclass SVM
time sequence
Sequence SVM 10
Frame-level Max-Margin Training j
… …
… …
3
…
2 1
DNN
Multiclass SVM 11
1st Step: training last layer j
…
Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕
… … …
3
…
2
𝒉𝑡
1
12
1st Step: training last layer ≡ Multiclass SVM j
…
Maximize the Margin S.T. score of ground true 𝒔𝒕 ≥ all competing 𝒔𝒕
… … …
3
…
2
𝒉𝑡
1
13
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
T
𝜕𝒉𝑡 𝜕𝓃
… … … 𝒉𝑡
14
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
Important!
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
… … 𝒉𝑡
15
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
Important!
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
… … 𝒉𝑡
Only the support vectors have gradient! 16
Sequence-level Max-Margin Training j
… … … …
3
…
2 1
DNN
Sequence SVM
time sequence
17
1st Step: training last layer j
…
… …
log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )
…
3
…
2
𝒉𝑡
reference state sequence 𝑺
Margin
1
competing state sequence 𝑺
18
1st Step: training last layer j
…
𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)
… …
log 𝑃(𝑺1:𝑇 |𝐎1:𝑇 )
…
3
…
2
𝒉𝑡
reference state sequence 𝑺
Margin
1
competing state sequence 𝑺
19
1st Step: training last layer j
…
…
𝐰𝑠T𝑡 𝒉𝑡
𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇
…
𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )
…
3
…
2
𝒉𝑡
𝑡=1
1
20
1st Step: training last layer 𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) states j
𝐰𝑗T 𝒉𝑡
𝑇
𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )
j
𝑡=1
…
6
…
5
log 𝑃(𝑠𝑡 |𝑠𝑡−1 )
…
4
…
3
3
…
2
2
𝒉𝑡
1
𝐰1T 𝒉𝑡
1
time
1st Step: training last layer ≡ Struct SVM
states j
𝐰𝑗T 𝒉𝑡
j
…
6
…
5
each path defines a feature space 𝝓 𝐎, 𝑺
𝑤𝑖𝑗 log 𝑃(𝑠𝑡 |𝑠𝑡−1 )
…
4
…
3
3
…
2
2
𝒉𝑡
1
𝐰1T 𝒉𝑡
1
time
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
T
𝜕𝒉𝑡 𝜕𝓃
𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺 𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺)
… … 𝒉𝑡
23
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
Important!
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
… … 𝒉𝑡
Margin
Only the support vectors have gradient!
24
Decoding
25
Decoding of Deep Neural SVMs states 𝐰𝑗T 𝒉𝑡
j
…
6
…
5
𝑤𝑖𝑗 log 𝑎𝑖𝑗
…
4
…
3
𝐰2T 𝒉𝑡
… 𝒉𝑡
𝐰1T 𝒉𝑡
2 1
time 26
Experiments
27
Experiments TIMIT: Continuous Phone Recognition (3 states for each 61 monophones) GMM ML
DNN CE
31.0%
22.9%
DNSVM Frame Max Margin Sequence Max Margin 22.0%
21.0%
8% improvement over DNN
28
Experiments
DNSVM (frame-level) Last layer only 22.03%
+previous layers 21.95%
• Most of gains come from last layer
DNSVM (Sequence-level) acoustic only joint learn with LM 21.38%
21.04%
• Joint learn with LM yields 1.6% improvement
29
Conclusion
30
Conclusion
Future work: Deep Neural SVM
31
Complementary
32
Support Vector Machines (binary / multiclass / sequence) Kernel function
𝐰T𝝓 𝒙 =
𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖
𝝓(𝒙)
𝑘 𝒙, 𝒙𝑖 = 𝝓 𝒙 T 𝝓 𝒙𝑖 𝒙
𝒙 𝑦 = 𝐰 T 𝝓(𝒙)
𝑦=
𝛼𝑖 𝑘(𝒙, 𝒙𝑖 ) 𝑖
𝑘(𝒙, 𝒙𝑵 )
33
The architecture of Deep Neural SVM states j
j
…
6
…
5
…
4
…
3
3
…
2
2
1
1
time
DNN Sequence SVM
34
Frame-level max-margin training (previous layers)
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
T
𝜕𝒉𝑡 𝜕𝓃
… … … 𝒉𝑡
1 ℱ= 𝐰 2
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
35
Frame-level max-margin training (previous layers)
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
T
…
Important!
… … 𝒉𝑡
1 ℱ= 𝐰 2
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
36
Frame-level max-margin training (previous layers)
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
T
…
Important!
… … 𝒉𝑡
1 ℱ= 𝐰 2
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
37
Frame-level max-margin training (previous layers)
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
T
…
Important!
… … 𝒉𝑡
1 ℱ= 𝐰 2
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
Only the support vectors have gradient! 38
Frame-level max-margin training (previous layers)
1 ℱ= 𝐰 2
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
𝓷
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡
T
Important!
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
39
Frame-level max-margin training (previous layers)
1 ℱ= 𝐰 2
𝑇 2 2
1 + max 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡
+𝐶 𝑡=1
𝑠𝑡
2 +
𝓷
𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡 𝒉𝑡
Key
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
40
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
Important!
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
… … 𝒉𝑡
𝜕ℱ = 2𝐶 𝜕𝒉𝑡
𝑇
1 + 𝐰𝑠T𝑡 𝒉𝑡 − 𝐰𝑠T𝑡 𝒉𝑡 𝑡=1
+
(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )
Only the support vectors have gradient! 41
Sequence-level max-margin training (last layer)
j
…
Maximize the Margin S.T. score of r𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝑺𝟏:𝑻 ≥ all competing 𝑺𝟏:𝑻
… … …
3
…
2
𝒉𝑡
1
42
2nd Step: training previous layers 𝜕ℱ 𝜕ℱ = 𝜕𝓃 𝜕𝒉𝑡
…
… 𝓷
…
Important!
T
𝜕𝒉𝑡 𝜕𝓃 Same as DNN
… … 𝒉𝑡
𝜕ℱ = 2𝐶 𝜕𝒉𝑡
𝑇
ℒ + PathScore𝑆 − PathScore𝑺
+
(𝐰𝑠𝑡 − 𝐰𝑠𝑡 )
𝑡=1
Only the support vectors have gradient! 43
Sequence-level max-margin training (last layer) ≡ Struct SVM
j
…
…
𝐰𝑠T𝑡 𝒉𝑡
𝑃(𝑺|𝐎) 𝑃(𝐎|𝑺)𝑃(𝑺) ℱ = min log = min log 𝑺≠𝑺 𝑺≠𝑺 𝑃(𝑺|𝐎) 𝑃 𝐎 𝑺 𝑃(𝑺) 𝑇
…
𝐰𝑠T𝑡 𝒉𝑡 − log 𝑃 𝑠𝑡 + log 𝑃(𝑠𝑡 |𝑠𝑡−1 )
…
3
…
2
𝒉𝑡
𝑡=1
𝑻 1
𝐰 T 𝝓 𝐎, 𝑺 = 𝒕=𝟏
𝐰1 ⋮ 𝐰𝑁 −1 +1
T
𝛿 𝑠𝑡 = 1 𝒉𝑡 ⋮ 𝛿 𝑠𝑡 = 𝑁 𝒉𝑡 log 𝑃(𝑠𝑡 ) log 𝑃(𝑠𝑡 |𝑠𝑡−1 ) 44
×
Report "Deep Neural SVM for Speech Recognition"
Your name
Email
Reason
-Select Reason-
Pornographic
Defamatory
Illegal/Unlawful
Spam
Other Terms Of Service Violation
File a copyright complaint
Description
×
Sign In
Email
Password
Remember me
Forgot password?
Sign In
Our partners will collect data and use cookies for ad personalization and measurement.
Learn how we and our ad partner Google, collect and use data
.
Agree & close