not consistent in this sense. ⢠Here we show that,unlike hinge loss or log loss, latent-structural probit loss and ram
Generalization Bounds and Consistency for Latent-Structural Probit and Ramp Loss
David McAllester TTI-Chicago
Motivation
• Systems are often evaluated with some quantitative performance metric or task loss.
• Here we consider consistency in the sense of minimizing task loss. SVMs are not consistent in this sense.
• Here we show that,unlike hinge loss or log loss, latent-structural probit loss and ramp loss are both consistent.
• We also give finite-sample generalizations bounds and point to a variety of supporting empirical work.
Why Surrogate Loss Functions We consider an arbitrary input space X and a finite label space Y, a source probability distribution over pairs (x, y), and a task loss L with L(y, yˆ) ∈ [0, 1]. We will use a linear classifier with parameter vector w yˆw (x) = argmax w>φ(x, y) y
We would like w∗ = argmin Ex,y [L(y, yˆw (x))] w
We get wˆ = argmin w
n X i=1
! Ls(w, xi, yi)
λ + ||w||2 2
The surrogate loss Ls must be scale-sensitive and hence different form the task loss L
Standard Surrogate Loss Functions (Binary Case) y ∈ {−1, 1}
yˆw (x) = sign w>Φ(x)
1 for Φ(x, y) = yΦ(x) 2
m = yw>φ(x)
−m
Llog (w, x, y) = ln 1 + e
Lhinge(w, x, y) = max(0, 1 − m) Lramp(w, x, y) = min(1, max(0, 1 − m)) Lprobit(w, x, y) = P∼N (0,1)[ ≥ m] (assuming ||Φ(x)|| = 1)
Latent Labels
We now assume a finite set Z of “latent labels”.
sˆw (x) = argmax w>φ(x, y, z) (y,z)
w∗ = argmin Ex,y [L(y, sˆw (x))] w
L(y, (ˆ y , zˆ)) = L(y, yˆ)
Surrogate Loss Functions
1 = ln Zw (x) − ln Zw (x, y) Llog (w, x, y) = ln Pw (y|x) > exp(w Φ(x, y, z)) y,z
Zw (x) =
P
Zw (x, y) =
P
Lhinge(w, x, y) =
Lramp(w, x, y) =
z
exp(w>φ(x, (y, z)))
max w>φ(x, s) + L(y, s) − max w>Φ(x, (y, z)) s
z
>
>
max w φ(x, s) + L(y, s) − max w Φ(x, s) s
Lprobit(w, x, y) = E [L(y, sˆw+(x))]
s
Citations
Llog : CRFs: J. Lafferty, A. McCallum, and F. Pereira. 2001. Hidden: A. Quattoni, S. Wang, L.P. Morency, M Collins, and T Darrell., 2007. Lhinge: Struct: B. Taskar, C. Guestrin, and D. Koller. 2003., I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, 2004. Hidden: Chun-Nam John Yu and T. Joachims. 2009. Lramp: Binary: R.Collobert, F.H.Sinz, J.Weston, and L.Bottou, 2006. Struct: Chuong B. Do, Quoc Le, Choon Hui Teo, Olivier Chapelle, and Alex Smola, 2008. Lprobit: Joseph Keshet, David McAllester, and Tamir Hazan. ICASSP, 2011
Tighter Upper Bound on Task Loss
Lemma: Lhinge(w, x, y) ≥ Lramp(w, x, y) ≥ L(w, x, y). Lhinge(w, x, y) ≥ Lramp(w, x, y) is immediate (see definitions).
Furthermore:
Lramp(w, x, y) =
max w>Φ(x, s) + L(y, s) − w>Φ(x, sˆw (x)) s
≥ w>Φ(x, sˆw (x)) + L(y, sˆw (x)) − w>Φ(x, sˆw (x)) = L(y, sˆw (x))
Ramp Loss (or Direct Loss) Perceptron Updates Optimizing Lramp through subgradient descent yields the following update rule (here we ignore regularization).
∆w ∝ φ(x, sˆw (x)) − φ(x, sˆ+ w (x, y)) > sˆ+ w (x, y) = argmax w φ(x, s) + L(y, s) s
under mild conditions on the probability distribution over pairs (x, y) we have the following [McAllester, Hazan, Keshet, 2010].
+ ∇w L(w) = lim αEx,y φ(x, sˆαw (x, y)) − φ(x, sˆαw (x)) α→∞
Empirical Results Methods closely related to ramp loss updates with early stopping have been shown to be effective in machine translation. • P. Liang, A. Bouchard-Ct, D. Klein, and B. Taskar. (COLING/ACL), 2006. • D. Chiang, K. Knight, and W. Wang. NAACL, 2009 Ramp loss updates regularized with early stopping (direct loss) has been shown to give improvements over hinge loss in phoneme alignment and phoneme recognition. • McAllester, Hazan, and Keshet. NIPS 2010. • Keshet, Cheng, Stoehr, and McAllester, to appear at Interspeech 2011 Probit loss has been show to give an improvement over hings loss for phoneme recognition. • Keshet, McAllester, and Hazan, ICASSP, 2011.
Some Notation
L(w) = Ex,y [L(w, x, y)]
L∗ = inf L(w) w
n
X 1 n ˆ (w) = L L(w, xi, yi) n i=1
Consistency of Probit Loss
We consider the following learning rule where λn is some given function of n.
λn n ˆ wˆn = argmin w Lprobit(w) + ||w||2 2n
If • λn increases without bound • (λn ln n)/n converges to zero then lim Lprobit(wˆn) = L∗
n→∞
PAC-Bayesian Bounds [Catoni 07], [Germain, Lacasse, Laviolette, Marchand 09] For any fixed prior distribution P and fixed λ > 1/2 we have that with probability at least 1 − δ over the draw of the training data the following holds simultaneously for all Q.
1 L(Q) ≤ 1 1 − 2λ
ˆ n(Q) + λ L
KL(Q, P ) + ln n
1 δ
!!
Corollary: 1 Lprobit(w) ≤ 1 − 2λ1n
ˆ n (w) + λn L probit
1 2 ||w|| 2
+ ln n
1 δ
!!
Consistency of Ramp Loss
Now we consider the following ramp loss training equation. ˆ n (w) + γn ||w||2 wˆn = argmin w L ramp 2n
If • γn/(ln2 n) increases without bound • γn/(n ln n) converges to zero, then lim Lprobit((ln n)wˆn) = L∗
n→∞
(1)
Main Lemma
lim Lprobit(w/σ, x, y) ≤ L(w, x, y) ≤ Lramp(w, x, y)
σ→0
.
Lprobit
w σ
, x, y ≤ Lramp(w, x, y) + σ + σ
r
|S| 8 ln σ
Proof of Main Lemma Part I
Lprobit
w σ
, x, y ≤ σ +
max
L(y, s)
s: m(s)≤M
where r >
m(s) = w ∆φ(s)
∆φ(s) = φ(x, sˆw (x)) − φ(x, s)
M =σ
|S| 8 ln σ
Proof: for m(s) > M we have the following. > P[ˆ sw+σ(x) = s] ≤ P[(w + σ) ∆φ(s) ≤ 0] = P − ∆φ(s) ≥ m(s)/σ 2 M σ M ≤ exp − 2 = ≤ P∼N (0,1) ≥ 2σ 8σ |S| >
E [L(y, sˆw+σ(x))] ≤ P [∃s : m(s) > M sˆw+σ (x) = s] + max L(y, s) s:m(s)≤M
≤ σ + max L(y, s) s:m(s)≤M
Proof of Main Lemma Part II
Lprobit
w σ
, x, y
≤ σ+
max
L(y, s)
s: m(s)≤M
≤ σ+
L(y, s) − m(s) + M s: m(s)≤M ≤ σ + max L(y, s) − m(s) + M max
s
= σ + Lramp(w, x, y) + M
Using the Main Lemma
Lprobit
w σ
1 ˆn Lramp(w) + σ + σ ≤ 1 1 − 2λn
r
|S| 8 ln + λn σ
Now take σn = 1/ ln n
λn = γn/(ln2 n)
||w||2 2σ 2
+ ln 1δ n
A Comparison of Convergence Rates
Optimizing σ as a function of λ, ||w|| and n we get (approximately). 2
σ = λn||w|| /n
which gives w 1 Lprobit ≤ σ 1 − 2λ1n
ˆ n (w) + L ramp
1/3
2
λn||w|| n
1/3
3 + 2
r
|S| 8 ln σ
!
which should be contrasted with 1 Lprobit(w) ≤ 1 − 2λ1n
ˆ n (w) + λn L probit
1 2 ||w|| 2
+ ln n
1 δ
!!
λn ln + n
1 δ
!
Summary
• Well known Surrogate loss functions have natural generalizations to the latent structural setting.
• Convex loss functions are not consistent.
• Probit and Ramp loss are consistent but seem significantly different in the latent structural setting.
Future Work
Early empirical evidence suggests that “direct loss” early stopping (without other regularization) works significantly better than L2 regularization for ramp loss.
The direct loss theorem holds for the “toward better” variant of ramp loss. This variant works better in practice.
This provides both empirical and theoretical evidence that the analysis presented here is too weak.