Document not found! Please try again

Examples with importance weights - GitHub

51 downloads 267 Views 186KB Size Report
Page 3 ... Learning with importance weights y. wT t x wT t+1x s(h)||x||2 ... ∣p=(wt−s(h)x)Tx s (h) = η. ∂l(p,y).
Examples with importance weights Sometimes some examples are more important. Importance weights pop up in: boosting, differing train/test distributions, active learning, etc. John can reduce everything to importance weighted binary classification.

Principle Having an example with importance weight h should be equivalent to having the example h times in the dataset.

Learning with importance weights

y

Learning with importance weights

wt> x

y

Learning with importance weights

−η(∇`)> x

wt> x

y

Learning with importance weights

−η(∇`)> x

> wt> x wt+1 x

y

Learning with importance weights

−6η(∇`)> x

wt> x

y

Learning with importance weights

−6η(∇`)> x

wt> x

y

> wt+1 x ??

Learning with importance weights

−η(∇`)> x

wt> x

y

> wt+1 x

Learning with importance weights

> wt> x wt+1 x y

Learning with importance weights

s(h)||x||2 > wt> x wt+1 x y

What is s(·)? Losses for linear models `(w > x, y ). ∇w ` = Update must be given by

∂`(p,y ) ∂p x

wt+1 = wt − s(h)x s(h) must satisfy ∂`(p, y ) s(h + ) = s(h) + η ∂p p=(wt −s(h)x)> x ∂`(p, y ) 0 s (h) = η ∂p p=(wt −s(h)x)> x Finally s(0) = 0

Many loss functions Loss

`(p, y )

Squared

(y − p)2

Logistic

log(1 + e −yp )

Exponential

e −yp

Logarithmic

y log yp + (1 − y ) log 1−y 1−p

W (e

p−y x>x hηx > x+yp+e yp

)−hηx > x−e yp for y ∈ {−1, 1} yx > x py −log(hηx > x+e py ) for y ∈ {−1, 1} x > xy q p−1+ (p−1)2 +2hηx > x

if y = 0

q x>x p− p 2 +2hηx > x x>x p−1+ 1 (12hηx > x+8(1−p)3/2 )2/3 4 y =0 x>x p− 1 (12hηx > x+8p 3/2 )2/3 4 y =1 “ ” x>x −y min hη, 1−yp for y ∈ {−1, 1} x>x if y > p −τ min(hη, y −p ) τx>x p−y if y ≤ p (1 − τ ) min(hη, ) (1−τ )x > x

if y = 1

Hellinger

(



p−

√ √ √ 2 y ) − ( 1 − p − 1 − y )2

if if

Hinge τ -Quantile

max(0, 1 − yp) if y > p if y ≤ p

τ (y − p) (1 − τ )(p − y )

Update s(h) „ « > 1 − e −hηx x

Robust results for unweighted problems astro - logistic loss

spam - quantile loss

0.97

0.98

0.96

0.97 0.96 standard

standard

0.95 0.94 0.93 0.92

0.95 0.94 0.93 0.92

0.91

0.91

0.9

0.9 0.9

0.91

0.92

0.93 0.94 0.95 importance aware

0.96

0.97

0.9

1

0.945

0.99

0.94

0.98

0.935

0.97

0.93

0.96

0.925 0.92

0.93 0.94 0.95 importance aware

0.96

0.97

0.98

0.95 0.94

0.915

0.93

0.91

0.92

0.905

0.91

0.9

0.92

webspam - hinge loss

0.95

standard

standard

rcv1 - squared loss

0.91

0.9 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 importance aware

0.9

0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 importance aware

And now something completely different Adaptive, individual learning rates in VW. It’s really GD separately on each coordinate i with ηt,i = r Pt

s=1

1 

∂`(ws> xs ,ys ) ∂ws,i

2

Coordinate-wise scaling of the data less of an issue Can state this formally (Duchi, Hazan, and Singer / McMahan and Streeter, COLT 2010)

Some tricks involved Store sum of squared gradients w.r.t wi near wi . float InvSqrt(float x){ float xhalf = 0.5f * x; int i = *(int*)&x; i = 0x5f3759d5 - (i >> 1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; } Special SSE rsqrt instruction is a little better

Experiments Raw Data ./vw --adaptive -b 24 --compressed -d tmp/spam_train.gz average loss = 0.02878 ./vw -b 24 --compressed -d tmp/spam_train.gz -l 100 average loss = 0.03267 TFIDF scaled data ./vw --adaptive -b 24 --compressed -d tmp/rcv1_train.gz -l 1 average loss = 0.04079 ./vw -b 24 --compressed -d tmp/rcv1_train.gz -l 256 average loss = 0.04465