Page 3 ... Learning with importance weights y. wT t x wT t+1x s(h)||x||2 ... â£p=(wtâs(h)x)Tx s (h) = η. âl(p,y).
Examples with importance weights Sometimes some examples are more important. Importance weights pop up in: boosting, differing train/test distributions, active learning, etc. John can reduce everything to importance weighted binary classification.
Principle Having an example with importance weight h should be equivalent to having the example h times in the dataset.
Learning with importance weights
y
Learning with importance weights
wt> x
y
Learning with importance weights
−η(∇`)> x
wt> x
y
Learning with importance weights
−η(∇`)> x
> wt> x wt+1 x
y
Learning with importance weights
−6η(∇`)> x
wt> x
y
Learning with importance weights
−6η(∇`)> x
wt> x
y
> wt+1 x ??
Learning with importance weights
−η(∇`)> x
wt> x
y
> wt+1 x
Learning with importance weights
> wt> x wt+1 x y
Learning with importance weights
s(h)||x||2 > wt> x wt+1 x y
What is s(·)? Losses for linear models `(w > x, y ). ∇w ` = Update must be given by
∂`(p,y ) ∂p x
wt+1 = wt − s(h)x s(h) must satisfy ∂`(p, y ) s(h + ) = s(h) + η ∂p p=(wt −s(h)x)> x ∂`(p, y ) 0 s (h) = η ∂p p=(wt −s(h)x)> x Finally s(0) = 0
Many loss functions Loss
`(p, y )
Squared
(y − p)2
Logistic
log(1 + e −yp )
Exponential
e −yp
Logarithmic
y log yp + (1 − y ) log 1−y 1−p
W (e
p−y x>x hηx > x+yp+e yp
)−hηx > x−e yp for y ∈ {−1, 1} yx > x py −log(hηx > x+e py ) for y ∈ {−1, 1} x > xy q p−1+ (p−1)2 +2hηx > x
if y = 0
q x>x p− p 2 +2hηx > x x>x p−1+ 1 (12hηx > x+8(1−p)3/2 )2/3 4 y =0 x>x p− 1 (12hηx > x+8p 3/2 )2/3 4 y =1 “ ” x>x −y min hη, 1−yp for y ∈ {−1, 1} x>x if y > p −τ min(hη, y −p ) τx>x p−y if y ≤ p (1 − τ ) min(hη, ) (1−τ )x > x
if y = 1
Hellinger
(
√
p−
√ √ √ 2 y ) − ( 1 − p − 1 − y )2
if if
Hinge τ -Quantile
max(0, 1 − yp) if y > p if y ≤ p
τ (y − p) (1 − τ )(p − y )
Update s(h) „ « > 1 − e −hηx x
Robust results for unweighted problems astro - logistic loss
spam - quantile loss
0.97
0.98
0.96
0.97 0.96 standard
standard
0.95 0.94 0.93 0.92
0.95 0.94 0.93 0.92
0.91
0.91
0.9
0.9 0.9
0.91
0.92
0.93 0.94 0.95 importance aware
0.96
0.97
0.9
1
0.945
0.99
0.94
0.98
0.935
0.97
0.93
0.96
0.925 0.92
0.93 0.94 0.95 importance aware
0.96
0.97
0.98
0.95 0.94
0.915
0.93
0.91
0.92
0.905
0.91
0.9
0.92
webspam - hinge loss
0.95
standard
standard
rcv1 - squared loss
0.91
0.9 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 importance aware
0.9
0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 importance aware
And now something completely different Adaptive, individual learning rates in VW. It’s really GD separately on each coordinate i with ηt,i = r Pt
s=1
1
∂`(ws> xs ,ys ) ∂ws,i
2
Coordinate-wise scaling of the data less of an issue Can state this formally (Duchi, Hazan, and Singer / McMahan and Streeter, COLT 2010)
Some tricks involved Store sum of squared gradients w.r.t wi near wi . float InvSqrt(float x){ float xhalf = 0.5f * x; int i = *(int*)&x; i = 0x5f3759d5 - (i >> 1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; } Special SSE rsqrt instruction is a little better
Experiments Raw Data ./vw --adaptive -b 24 --compressed -d tmp/spam_train.gz average loss = 0.02878 ./vw -b 24 --compressed -d tmp/spam_train.gz -l 100 average loss = 0.03267 TFIDF scaled data ./vw --adaptive -b 24 --compressed -d tmp/rcv1_train.gz -l 1 average loss = 0.04079 ./vw -b 24 --compressed -d tmp/rcv1_train.gz -l 256 average loss = 0.04465