Generalization Bounds and Consistency for Latent-Structural Probit ...

Generalization Bounds and Consistency for Latent-Structural Probit and Ramp Loss

David McAllester TTI-Chicago

Motivation

• Systems are often evaluated with some quantitative performance metric or task loss.

• Here we consider consistency in the sense of minimizing task loss. SVMs are not consistent in this sense.

• Here we show that,unlike hinge loss or log loss, latent-structural probit loss and ramp loss are both consistent.

• We also give finite-sample generalizations bounds and point to a variety of supporting empirical work.

Why Surrogate Loss Functions We consider an arbitrary input space X and a finite label space Y, a source probability distribution over pairs (x, y), and a task loss L with L(y, yˆ) ∈ [0, 1]. We will use a linear classifier with parameter vector w yˆw (x) = argmax w>φ(x, y) y

We would like w∗ = argmin Ex,y [L(y, yˆw (x))] w

We get wˆ = argmin w

n X i=1

! Ls(w, xi, yi)

λ + ||w||2 2

The surrogate loss Ls must be scale-sensitive and hence different form the task loss L

Standard Surrogate Loss Functions (Binary Case) y ∈ {−1, 1}

yˆw (x) = sign w>Φ(x)

1 for Φ(x, y) = yΦ(x) 2

m = yw>φ(x)

−m

Llog (w, x, y) = ln 1 + e

Lhinge(w, x, y) = max(0, 1 − m) Lramp(w, x, y) = min(1, max(0, 1 − m)) Lprobit(w, x, y) = P∼N (0,1)[ ≥ m] (assuming ||Φ(x)|| = 1)

Latent Labels

We now assume a finite set Z of “latent labels”.

sˆw (x) = argmax w>φ(x, y, z) (y,z)

w∗ = argmin Ex,y [L(y, sˆw (x))] w

L(y, (ˆ y , zˆ)) = L(y, yˆ)

Surrogate Loss Functions

1 = ln Zw (x) − ln Zw (x, y) Llog (w, x, y) = ln Pw (y|x) > exp(w Φ(x, y, z)) y,z

Zw (x) =

P

Zw (x, y) =

P

Lhinge(w, x, y) =

Lramp(w, x, y) =

z

exp(w>φ(x, (y, z)))

max w>φ(x, s) + L(y, s) − max w>Φ(x, (y, z)) s

z

>

>

max w φ(x, s) + L(y, s) − max w Φ(x, s) s

Lprobit(w, x, y) = E [L(y, sˆw+(x))]

s

Citations

Llog : CRFs: J. Lafferty, A. McCallum, and F. Pereira. 2001. Hidden: A. Quattoni, S. Wang, L.P. Morency, M Collins, and T Darrell., 2007. Lhinge: Struct: B. Taskar, C. Guestrin, and D. Koller. 2003., I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, 2004. Hidden: Chun-Nam John Yu and T. Joachims. 2009. Lramp: Binary: R.Collobert, F.H.Sinz, J.Weston, and L.Bottou, 2006. Struct: Chuong B. Do, Quoc Le, Choon Hui Teo, Olivier Chapelle, and Alex Smola, 2008. Lprobit: Joseph Keshet, David McAllester, and Tamir Hazan. ICASSP, 2011

Tighter Upper Bound on Task Loss

Lemma: Lhinge(w, x, y) ≥ Lramp(w, x, y) ≥ L(w, x, y). Lhinge(w, x, y) ≥ Lramp(w, x, y) is immediate (see definitions).

Furthermore:

Lramp(w, x, y) =

max w>Φ(x, s) + L(y, s) − w>Φ(x, sˆw (x)) s

≥ w>Φ(x, sˆw (x)) + L(y, sˆw (x)) − w>Φ(x, sˆw (x)) = L(y, sˆw (x))

Ramp Loss (or Direct Loss) Perceptron Updates Optimizing Lramp through subgradient descent yields the following update rule (here we ignore regularization).

∆w ∝ φ(x, sˆw (x)) − φ(x, sˆ+ w (x, y)) > sˆ+ w (x, y) = argmax w φ(x, s) + L(y, s) s

under mild conditions on the probability distribution over pairs (x, y) we have the following [McAllester, Hazan, Keshet, 2010].

+ ∇w L(w) = lim αEx,y φ(x, sˆαw (x, y)) − φ(x, sˆαw (x)) α→∞

Empirical Results Methods closely related to ramp loss updates with early stopping have been shown to be effective in machine translation. • P. Liang, A. Bouchard-Ct, D. Klein, and B. Taskar. (COLING/ACL), 2006. • D. Chiang, K. Knight, and W. Wang. NAACL, 2009 Ramp loss updates regularized with early stopping (direct loss) has been shown to give improvements over hinge loss in phoneme alignment and phoneme recognition. • McAllester, Hazan, and Keshet. NIPS 2010. • Keshet, Cheng, Stoehr, and McAllester, to appear at Interspeech 2011 Probit loss has been show to give an improvement over hings loss for phoneme recognition. • Keshet, McAllester, and Hazan, ICASSP, 2011.

Some Notation

L(w) = Ex,y [L(w, x, y)]

L∗ = inf L(w) w

n

X 1 n ˆ (w) = L L(w, xi, yi) n i=1

Consistency of Probit Loss

We consider the following learning rule where λn is some given function of n.

λn n ˆ wˆn = argmin w Lprobit(w) + ||w||2 2n

If • λn increases without bound • (λn ln n)/n converges to zero then lim Lprobit(wˆn) = L∗

n→∞

PAC-Bayesian Bounds [Catoni 07], [Germain, Lacasse, Laviolette, Marchand 09] For any fixed prior distribution P and fixed λ > 1/2 we have that with probability at least 1 − δ over the draw of the training data the following holds simultaneously for all Q.

1 L(Q) ≤ 1 1 − 2λ

ˆ n(Q) + λ L

KL(Q, P ) + ln n

1 δ

!!

Corollary: 1 Lprobit(w) ≤ 1 − 2λ1n

ˆ n (w) + λn L probit

1 2 ||w|| 2

+ ln n

1 δ

!!

Consistency of Ramp Loss

Now we consider the following ramp loss training equation. ˆ n (w) + γn ||w||2 wˆn = argmin w L ramp 2n

If • γn/(ln2 n) increases without bound • γn/(n ln n) converges to zero, then lim Lprobit((ln n)wˆn) = L∗

n→∞

(1)

Main Lemma

lim Lprobit(w/σ, x, y) ≤ L(w, x, y) ≤ Lramp(w, x, y)

σ→0

.

Lprobit

w σ

, x, y ≤ Lramp(w, x, y) + σ + σ

r

|S| 8 ln σ

Proof of Main Lemma Part I

Lprobit

w σ

, x, y ≤ σ +

max

L(y, s)

s: m(s)≤M

where r >

m(s) = w ∆φ(s)

∆φ(s) = φ(x, sˆw (x)) − φ(x, s)

M =σ

|S| 8 ln σ

Proof: for m(s) > M we have the following. > P[ˆ sw+σ(x) = s] ≤ P[(w + σ) ∆φ(s) ≤ 0] = P − ∆φ(s) ≥ m(s)/σ 2 M σ M ≤ exp − 2 = ≤ P∼N (0,1) ≥ 2σ 8σ |S| >

E [L(y, sˆw+σ(x))] ≤ P [∃s : m(s) > M sˆw+σ (x) = s] + max L(y, s) s:m(s)≤M

≤ σ + max L(y, s) s:m(s)≤M

Proof of Main Lemma Part II

Lprobit

w σ

, x, y

≤ σ+

max

L(y, s)

s: m(s)≤M

≤ σ+

L(y, s) − m(s) + M s: m(s)≤M ≤ σ + max L(y, s) − m(s) + M max

s

= σ + Lramp(w, x, y) + M

Using the Main Lemma

 Lprobit

w σ

1  ˆn Lramp(w) + σ + σ ≤ 1 1 − 2λn

r



|S| 8 ln + λn  σ

Now take σn = 1/ ln n

λn = γn/(ln2 n)

||w||2 2σ 2



+ ln 1δ  n

A Comparison of Convergence Rates

Optimizing σ as a function of λ, ||w|| and n we get (approximately). 2

σ = λn||w|| /n

which gives w 1 Lprobit ≤ σ 1 − 2λ1n

ˆ n (w) + L ramp

1/3

2

λn||w|| n

1/3

3 + 2

r

|S| 8 ln σ

!

which should be contrasted with 1 Lprobit(w) ≤ 1 − 2λ1n

ˆ n (w) + λn L probit

1 2 ||w|| 2

+ ln n

1 δ

!!

λn ln + n

1 δ

!

Summary

• Well known Surrogate loss functions have natural generalizations to the latent structural setting.

• Convex loss functions are not consistent.

• Probit and Ramp loss are consistent but seem significantly different in the latent structural setting.

Future Work

Early empirical evidence suggests that “direct loss” early stopping (without other regularization) works significantly better than L2 regularization for ramp loss.

The direct loss theorem holds for the “toward better” variant of ramp loss. This variant works better in practice.

This provides both empirical and theoretical evidence that the analysis presented here is too weak.

Generalization Bounds and Consistency for Latent-Structural Probit ...

Generalization Bounds and Consistency for Latent-Structural Probit ...

Suggest Documents

Generalization Bounds and Consistency for Latent Structural Probit

Consistency and Generalization Bounds for ... - Semantic Scholar

Generalization Bounds for Weighted Binary Classification with ...

Length-Lex Bounds Consistency for Knapsack ... - Andrew.cmu.edu

Relative Deviation Learning Bounds and Generalization with ...

A fast and simple algorithm for bounds consistency ... - Semantic Scholar

Generalization Bounds of Regularization Algorithm ...

A comparison of certain generalization bounds of

Identification in a Generalization of Bivariate Probit ... - Google Sites

Generalization Bounds for Minimum Volume Set ... - ISAIM 2018

Generalization Bounds for the Area Under the ROC Curve∗

Generalization Error Bounds for Aggregation by Mirror ... - CiteSeerX

Generalization error bounds for learning to rank: Does ... - Ambuj Tewari

Graph-based Generalization Bounds for Learning Binary Relations

Generalization error bounds for learning to rank: Does ... - Ambuj Tewari

Generalization Bounds Derived IPM-Based Regularization for Domain ...

Generalization Bounds for Learning Kernels - NYU Computer Science

An Efficient Bounds Consistency Algorithm for the Global ... - U Laval

Probit Analysis

Probit Analysis

Probit Analysis

Model selection consistency from the perspective of generalization ...

A Generalization of Generalized Arc Consistency: From Constraint

Probit Analysis