... Speech Recognition. S.-X. Austin Zhang and Mark Gales. 31 August 2011. Cambridge University Engineering Department. Interspeech 2011 Florence ...
Structured Support Vector Machines for Noise Robust Continuous Speech Recognition S.-X. Austin Zhang and Mark Gales 31 August 2011
Cambridge University Engineering Department
Interspeech 2011 Florence
Structured SVMs for Noise Robust CSR
SVMs for Continuous Speech Recognition • SVMs generate boundaries between classes – For isolated digit recognition, no problem (10 classes) sequence kernels: map to fixed length – Continuous speech? too many classes! (e.g., 6 digits ⇒ 106 classes) • Simplest approach:
FOUR
ONE
SEVEN
ONE
ONE
ONE
ZERO SIL
ZERO SIL
ZERO SIL
1. Using HMM-based Segmentation 2. For each segment, isolated classification
Problem: Restricted to one fixed segmentation • Alternatively, incorporate structures into SVM classes – Structured SVM ! Cambridge University Engineering Department
Interspeech 2011 Florence
1
Structured SVMs for Noise Robust CSR
Structured SVMs for CSR • Model whole utterance – observations O, word sequence w – joint features φ(O, w). How to build? generative model to extract features ⇒ model compensation l
s s
e
d
o
m
i
t
v
e
r
a
e n
e g
” E
N
O
λ
“
"
! #! ! ! #!
+
"
λ
”
“
O
T
W
: .
7
0
.
6
.
0
5
0
.
4
0
.
3
0
2
.
0
.
1
0
0
”
“
”
“
”
“
”
“
E
N
O
T
O
W
:
W
E
N
O
r
e
e
T
h
• Target: Learn discriminative parameters α Decoding by match score: arg max αTφ(O, w) w
Cambridge University Engineering Department
Interspeech 2011 Florence
2
Structured SVMs for Noise Robust CSR
Structured SVM Training • Training α, so that match score of correct reference wref ≥ all competing w
”
“ 0
0
0
0
0
, ”
“
1
e
l
m
S
a
p 1
,
”
“
3
2
1
n
i
t
e
o
m
c
l
t
o
a
g
p
1 )
(
1 )
(
f
r
e
”
“
9
9
9
, ”
“
0
0
0
,
e
n
l
m
S
a
”
“
p 1
0
0
,
”
“
6
5
4
n
i
t
e
c
o
m
l
t
a
o
g
p
n
n )
(
)
(
f
r
e
”
“
“
9
9
9
,
– Hard-margin, “+1” means “0-1 loss” Cambridge University Engineering Department
Interspeech 2011 Florence
3
Structured SVMs for Noise Robust CSR
Structured SVM Training • To generalize the training – Replace “0-1 loss” as L(wref, w) – Introduce slack variable ξi for linearly nonseparable cases
”
“
3
3
linear
2
1
α
2
1
”
“
• Unconstrained form min Fssvm(α)
(ξi is the part in · +) convex
}| {i z }| { z n h n o X 1 (i) (i) ||α||22 + C − αTφ(O(i), wref) + max L(w, wref) + αTφ(O(i), w) w6=wref 2 + r=1 e
n
o
e
n
o
,
Cambridge University Engineering Department
Interspeech 2011 Florence
4
Structured SVMs for Noise Robust CSR
Structured SVM ≡ Large Margin Log Linear Model • Example of log-linear models 1 T P (w|O; α) = exp α φ(O, w) Z • Margin: minimum distance between correct wref and competing w P (wref|O; α) min log w6=wref P (w|O; α) • Regularised, large margin criterion ≡ Structured SVM n h n oi X 1 (i) (i) ||α||22 +C −αTφ(O(i), wref)+ max L(w, wref) + αTφ(O(i), w) w6=wref 2 + r=1
Cambridge University Engineering Department
Interspeech 2011 Florence
5
Structured SVMs for Noise Robust CSR
Best Segmentation • Features depend on the segmentation θ
+
• For efficiency, consider one “best” segmentation – How to define “best”? – Previously, φ(O, w) = φ(O, w; θhmm) Problem: Best segmentation in HMM may not the best in SSVM – What we want: max αTφ(O, w; θ) θ
Cambridge University Engineering Department
Interspeech 2011 Florence
6
Structured SVMs for Noise Robust CSR
Can we optimize segmentation θ in Structured SVMs?
• How to learn α and θ jointly in training? • How to find w and θ in decoding?
Cambridge University Engineering Department
Interspeech 2011 Florence
7
Structured SVMs for Noise Robust CSR
Overall Training with Optimal Segmentation • Previously, no variable θ, minα (Convex Optimization) convex
linear }| {i z }| { z h n o n (i) T (i) 1 ||α||2 +C P −αT φ(O(i) , w(i) ) + max L(w, w ) + α φ(O , w) 2 ref ref 2 w6=wref
i=1
+
• Now, Optimizing θ with α (Nonconvex Optimization) concave
convex
z
}| }| { n o{ i hz (i) (i) −max αTφ(O(i), wref, θ) + max L(w, wref) + αTφ(O(i), w, θ) w6=wref ,θ
θ
+
e
n
o o
e
n
o
,
Cambridge University Engineering Department
Interspeech 2011 Florence
8
Structured SVMs for Noise Robust CSR
Overall Training with Optimal Segmentation • Previously, no variable θ, minα (Convex Optimization) convex
linear }| {i z }| { z h n o n (i) T (i) 1 ||α||2 +C P −αT φ(O(i) , w(i) ) + max L(w, w ) + α φ(O , w) 2 ref ref 2 w6=wref
i=1
+
• Now, Optimizing θ with α (Nonconvex Optimization) linear
convex
z
}| }| { hz n o{ i (i) (i) L(w, wref) + αTφ(O(i), w, θ) −αTφ(O(i), wref, θref) + max w6=wref ,θ
Cambridge University Engineering Department
Interspeech 2011 Florence
+
9
Structured SVMs for Noise Robust CSR
Overall Training with Optimal Segmentation • Algorithm (EM-style, guaranteed to converge): (i)
(i)
θref ← arg max αTφ(O(i), wref, θ), ∀i
➀ Find reference segmentation:
θ
(i)
➁ Original SSVM training:
α ← min Fssvm(α|θref)
• Illustration of Algorithm (Concave-Convex Procedure): n
i
n
m
n
r
r
i
F
n
t
o
t
a
e
e
s
e
c
e
e
e
d
g
f
1 e
a
c
v
o
C
n
x
v
n
e
o
C
r
i
L
n
a
e
2 i
n
i
n
r t
a
S
S
a
O
g
g
Interspeech 2011 Florence
M
V
l
i
n
i
r
Cambridge University Engineering Department
10
Structured SVMs for Noise Robust CSR
Decoding with Optimal Segmentation • Decoding with optimal segmentation θ arg max αTφ(O, w, θ) w,θ
• In general, very difficult. – For our features, efficient search algorithm exists T
Interspeech 2011 Florence
T
T
Cambridge University Engineering Department
11
Structured SVMs for Noise Robust CSR
Decoding with Optimal Segmentation • Search optimal position of “black nodes”, and the label between them.
• Viterbi-style search, ψ(t) = max ψ(tst) + tst ,w
w
α
w
α
“zero P”
k=“one”
(w)
αk HMMk
– Related to Factorial HMM inference Cambridge University Engineering Department
Interspeech 2011 Florence
12
Structured SVMs for Noise Robust CSR
Implementation • Overall framework:
2GXMK3GXMOT:XGOTOTM t
d e
a
d
U
p
i
i
)
)
(
(
g
f
r
e
n
1
i
t
n
a
r
e
n
e
G
i
g
m
n
i
o
t
a
t
e
n
e
m
s
e
c
e
n
r
e
e
h
r
h
r
a
c
e
c
a
r
e
S
r
t
a
o
r
e
m
u
N
g
f
m
i
)
(
a
e
i
t
c
t
L
a
r g o r P c
t
a
d e
d
U
i
p 2 t a r
i
t
n
a
r
e
n
e
G
d
g a
u
t
r
o
i
a
n
m
n
o
e
D
i
n
t
o
t
a
e
n
e
m
s
d
n
a
i
n
t
e
o
m
c
S
g
g
p
Q
e
i
t
c
t
L
a
:XGOTOTM6NGYK *KIUJOTM6NGYK
^
XMOT:XGOTOTM
i
t
n
a
r
e
n
e
G
g ^
t
r
o
i
a
n
m
n
o
e
D
i
n
t
o
t
a
e
n
s
e
m
d
a
n
t
s
e
b
r
c
h
S
a
e
g
i
t
e
t
c
L
a
d
i
l
z
e
l
l
e
r
a
:
a
P
2
– Loop
is parallelized training of original structured SVM
Cambridge University Engineering Department
Interspeech 2011 Florence
13
Structured SVMs for Noise Robust CSR
AURORA 2 Results • Noise corrupted continuous digit task. SNRs from 0dB–20dB. Model HMM
Training Decoding Avg. all — — 9.5 θhmm θhmm 7.8 Structured θhmm opt θ 7.6 SVM opt θ opt θ 7.4
• Structured SVMs with optimal segmentation: – Gain over VTS-compensated HMMs: 22% – Gain over Multi-Class SVMs: 10% – Gain over original Structured SVMs: 4% Cambridge University Engineering Department
Interspeech 2011 Florence
14
Structured SVMs for Noise Robust CSR
Conclusion • Examined Structured SVMs for noise robust speech recognition – equivalent to Large Margin Log Linear Models • Enabled optimising segmentation in Structured SVMs training – use concave-convex procedure • Enabled optimising segmentation in Structured SVMs decoding – proposed Viterbi-style efficient search • Good Performance
Cambridge University Engineering Department
Interspeech 2011 Florence
15