Foundations of Machine Learning. Lecture 9. Mehryar Mohri. Courant Institute
and Google Research
Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research
[email protected]
Motivation Very large data sets: too large to display or process. limited resources, need priorities. ranking more desirable than classification.
• • •
Applications: search engines, information extraction. decision making, auctions, fraud detection.
• •
Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning
page 2
Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. closeness measured in number of pairwise misrankings. problem NP-hard even for k = 4 (Dwork et al., 2001).
• •
Mehryar Mohri - Foundations of Machine Learning
page 3
This Talk Score-based ranking Preference-based ranking
Mehryar Mohri - Foundations of Machine Learning
page 4
Score-Based Setting Single stage: learning algorithm receives labeled sample of pairwise preferences; returns scoring function h: U R. Drawbacks: h induces a linear ordering for full set U . does not match a query-based scenario.
• • • •
Advantages: efficient algorithms. good theory:VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).
• •
Mehryar Mohri - Foundations of Machine Learning
page 5
Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , S = (x1 , x1 , y1 ), . . . , (xm , xm , ym )
with yi =
U
+1 if xi >pref xi 0 if xi =pref xi or no information 1 if xi 0 , m X 1 b R⇢ (h) = m i=1
⇢
⇣
yi h(x0i )
m X 1 b R⇢ (h) 1yi [h(x0i ) m i=1
Mehryar Mohri - Foundations of Machine Learning
h(xi )
h(xi )]⇢ .
page 9
⌘
.
Marginal Rademacher Complexities Distributions: D1 marginal distribution with respect to the first element of the pairs; D2 marginal distribution with respect to second element of the pairs.
• •
Samples: S1 = (x1 , y1 ), . . . , (xm , ym ) S2 = (x01 , y1 ), . . . , (x0m , ym ) .
Marginal Rademacher complexities: 1 b S (H)] RD (H) = E[ R m 1
Mehryar Mohri - Foundations of Machine Learning
2 b S (H)]. RD (H) = E[ R m 2 page 10
Ranking Margin Bound (Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012)
Theorem: let H be a family of real-valued functions. Fix > 0 , then, for any > 0 , with probability at least 1 over the choice of a sample of size m, the following holds for all h H : R(h)
R (h) +
Mehryar Mohri - Foundations of Machine Learning
2
D2 1 RD (H) + R m m (H) +
page 11
log 1 . 2m
Ranking with SVMs see for example (Joachims, 2002)
Optimization problem: application of SVMs. m
1 w min w, 2
2
i i=1
subject to: yi w · i
+C
0,
(xi ) i
(xi )
1
[1, m] .
Decision function: h: x
Mehryar Mohri - Foundations of Machine Learning
w·
(x) + b.
page 12
i
Notes The algorithm coincides with SVMs using feature mapping (x, x )
(x, x ) =
(x )
(x).
Can be used with kernels: K 0 ((xi , x0i ), (xj , x0j )) =
(xi , x0i ) ·
(xj , x0j )
= K(xi , xj ) + K(x0i , x0j )
K(x0i , xj )
Algorithm directly based on margin bound.
Mehryar Mohri - Foundations of Machine Learning
page 13
K(xi , x0j ).
Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined?
Mehryar Mohri - Foundations of Machine Learning
page 14
CD RankBoost (Freund et al., 2003; Rudin et al., 2005)
H
{0, 1}X .
0 t
+
+ t
+
t
= 1, st (h) =
Pr
(x,x ) Dt
sgn f (x, x )(h(x )
h(x)) = s .
RankBoost(S = ((x1 , x1 , y1 ) . . . , (xm , xm , ym ))) 1 for i 1 to m do 1 2 D1 (xi , xi ) m 3 for t 1 to T do 4 5 6 7
base ranker in H with smallest
ht
t
+ t
=
Ei
Dt
yi ht (xi )
+
t
Zt for i
8 9 T 10 return
1 t 2 log t + 0 t + 2[ t t
1
]2 1 to m do
Dt+1 (xi , xi ) T t=1 t ht
normalization factor Dt (xi ,xi ) exp
t yi
ht (xi ) ht (xi )
Zt
T
Mehryar Mohri - Foundations of Machine Learning
page 15
ht (xi )
Notes Distributions Dt over pairs of sample points: originally uniform. at each round, the weight of a misclassified example is increased. (x ) t (x)] e y[ tQ observation: Dt+1 (x, x ) = |S| ts=1 Zs , since
• • •
Dt+1 (x, x ) =
Dt (x, x )e
y
t [ht (x
Zt
) ht (x)]
1 e = |S|
y
Pt
s [hs (x
s=1
t s=1
) hs (x)]
Zs
weight assigned to base classifier ht : t directy depends on the accuracy of ht at round t .
Mehryar Mohri - Foundations of Machine Learning
page 16
.
Coordinate Descent RankBoost Objective Function: convex and differentiable. T
F( ) =
e
y[
T (x
)
T (x)]
=
exp
(x,x ,y) S
(x,x ,y) S
e
t [ht (x
y t=1
x
0 1 pairwise loss
Mehryar Mohri - Foundations of Machine Learning
page 17
)
ht (x)] .
• Direction: unit vector e with t
dF ( + et ) et = argmin d t
•
Since F (
dF ( + et ) d
+ et ) =
e
(x,x ,y) S
y
PT
s=1
s [hs (x
. =0 ) hs (x)]
e
y [ht (x ) ht (x)]
,
T
y[ht (x )
= =0
ht (x)] exp
s [hs (x
y
)
hs (x)]
s=1
(x,x ,y) S
T
=
y[ht (x ) (x,x ,y) S
=
[
+ t
t
ht (x)]DT +1 (x, x ) m
Zs s=1
T
] m
Zs . s=1
Thus, direction corresponding to base classifier selected by the algorithm.
Mehryar Mohri - Foundations of Machine Learning
page 18
• Step size: obtained via dF ( + et ) =0 d T
y[ht (x )
ht (x)] exp
s [hs (x
y
)
hs (x)] e
y[ht (x ) ht (x)]
s=1
(x,x ,y) S
T
y[ht (x )
ht (x)]DT +1 (x, x ) m
Zs e
y[ht (x ) ht (x)]
=0
s=1
(x,x ,y) S
y[ht (x )
ht (x)]DT +1 (x, x )e
y[ht (x ) ht (x)]
=0
(x,x ,y) S
[
+ t e
1 = log 2
t + t
e ]=0
.
t
Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning
page 19
=0
Bipartite Ranking Training data: sample of negative points drawn according to D
• S = (x , . . . , x ) U. • sample of positive points drawn according to D 1
m
S+ = (x1 , . . . , xm ) U.
Problem: find hypothesis h : U ! R in H with small generalization error RD (h) =
Mehryar Mohri - Foundations of Machine Learning
Pr
x D ,x
D+
h(x ) < h(x) .
page 20
+
Notes Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). if constant base ranker used. relationship between objective functions.
• •
More efficient algorithm in this special case (Freund et al., 2003). Bipartite ranking results typically reported in terms of AUC.
Mehryar Mohri - Foundations of Machine Learning
page 21
AdaBoost and CD RankBoost Objective functions: comparison. FAda ( ) =
exp ( yi f (xi )) xi S
=
S+
exp (+f (xi )) + xi S
exp ( f (xi )) x i S+
= F ( ) + F+ ( ). FRank ( ) =
exp (i,j) S
[f (xj )
f (xi )]
S+
=
exp (+f (xi )) exp ( f (xi )) (i,j) S
S+
= F ( )F+ ( ). Mehryar Mohri - Foundations of Machine Learning
page 22
AdaBoost and CD RankBoost (Rudin et al., 2005)
Property: AdaBoost (non-separable case). constant base learner h = 1 equal contribution of positive and negative points (in the limit). consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F+ ( ) = F ( ) ,
• •
d(FRank ) = F+ d(F ) + F d(F+ ) = F+ d(F ) + d(F+ ) = F+ d(FAda ). Mehryar Mohri - Foundations of Machine Learning
page 23
Bipartite RankBoost - Efficiency Decomposition of distribution: for (x, x ) (S , S+ ) , D(x, x ) = D (x)D+ (x ).
Thus, Dt+1 (x, x ) =
Dt (x, x )e
Zt, =
Dt, (x)e x S
Mehryar Mohri - Foundations of Machine Learning
) ht (x)]
Zt
Dt, (x)e = Zt,
with
t [ht (x
t ht (x)
t ht (x)
Dt,+ (x )e Zt,+
Zt,+ =
t ht (x
Dt,+ (x )e x
S+
page 24
)
, t ht (x
)
.
ROC Curve (Egan, 1975)
Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). TP: % positive points correctly labeled positive. FP: % negative points incorrectly labeled positive.
• •
True positive rate
1 .8 .6
-
.4
h(x3 )
.2
+
h(x14 ) h(x5 ) sorted scores
0 0
.2
.4 .6 .8 False positive rate
Mehryar Mohri - Foundations of Machine Learning
1 page 25
h(x1 )
h(x23 )
Area under the ROC Curve (AUC) (Hanley and McNeil, 1982)
Definition: the AUC is the area under the ROC curve. Measure of ranking quality. True positive rate
1 .8 .6
AUC
.4
.2 0 0
Equivalently, 1 AU C(h) = mm =1
m
.2
.4 .6 .8 False positive rate
m
1h(xj )>h(xi ) = i=1 j=1
R(h).
Mehryar Mohri - Foundations of Machine Learning
1
Pr [h(x ) > h(x)]
b x D b+ x D
page 26
This Talk Score-based ranking Preference-based ranking
Mehryar Mohri - Foundations of Machine Learning
page 27
Preference-Based Setting Definitions: U : universe, full set of objects. V : finite query subset to rank, V U. : target ranking for V (random variable).
• • •
Two stages: can be viewed as a reduction. learn preference function h: U U [0, 1] . given V , use h to determine ranking of V .
• •
Running-time: measured in terms of |calls to h |. Mehryar Mohri - Foundations of Machine Learning
page 28
Preference-Based Ranking Problem Training data: pairs (V, to D: (V1 , subsets ranked by different labelers.
) sampled i.i.d. according
1 ), (V2 , 2 ), . . . , (Vm , m )
Vi
U.
learn classifier
preference function h : U U [0, 1] . Problem: for any query set V U , use h to return ranking h,V close to target with small average error R(h, ) =
Mehryar Mohri - Foundations of Machine Learning
(V,
E
) D
[L(
h,V
,
)].
page 29
Preference Function h(u, v)close to 1 when u preferred to v , close to 0 otherwise. For the analysis, h(u, v) {0, 1} .
Assumed pairwise consistent:
h(u, v) + h(v, u) = 1.
May be non-transitive, e.g., we may have h(u, v) = h(v, w) = h(w, u) = 1.
Output of classifier or ‘black-box’.
Mehryar Mohri - Foundations of Machine Learning
page 30
Loss Functions (for fixed (V, ) ) Preference loss: L(h,
)=
2 n(n
1)
h(u, v) (v, u). u⇥=v
Ranking loss: L( , ⇥ ) =
2 n(n
Mehryar Mohri - Foundations of Machine Learning
1)
(u, v)⇥ (v, u). u⇥=v
page 31
(Weak) Regret Preference regret: Rclass (h) = E [L(h|V , V,
)]
˜ E min E [L(h, V
˜ h
|V
)].
Ranking regret: R⇥rank (A) = E [L(As (V ), ⇥ )] V,⇥ ,s
Mehryar Mohri - Foundations of Machine Learning
E min
E [L(˜ , ⇥ )].
V ˜ ⇤S(V ) ⇥ |V
page 32
Deterministic Algorithm (Balcan et al., 07)
Stage one: standard classification. Learn preference function h : U U [0, 1] . Stage two: sort-by-degree using comparison function h . sort by number of points ranked below. quadratic time complexity O(n2 ).
• •
Mehryar Mohri - Foundations of Machine Learning
page 33
Randomized Algorithm (Ailon & MM, 08)
Stage one: standard classification. Learn preference function h : U U [0, 1] .
Stage two: randomized QuickSort (Hoare, 61) using h as comparison function. comparison function non-transitive unlike textbook description. but, time complexity shown to be O(n log n) in general.
• •
Mehryar Mohri - Foundations of Machine Learning
page 34
Randomized QS v h(v, u) = 1
h(u, v) = 1
u random pivot left recursion
Mehryar Mohri - Foundations of Machine Learning
right recursion
page 35
Deterministic Algo. - Bipartite Case (V = V+
V )
(Balcan et al., 07)
Bounds: for deterministic sort-by-degree algorithm expected loss:
•
E [L(A(V ),
V,
• regret:
)]
Rrank (A(V ))
2 E [L(h, V,
)].
2 Rclass (h).
Time complexity: (|V |2 ).
Mehryar Mohri - Foundations of Machine Learning
page 36
Randomized Algo. - Bipartite Case (V = V+
V )
Bounds: for randomized QuickSort expected loss (equality):
•
E [L(Qhs (V ),
V,
• regret:
,s
Rrank (Qhs (·))
(Ailon & MM, 08) (Hoare, 61).
)] = E [L(h,
Time complexity: full set: O(n log n) . top k: O(n + k log k).
V,
)].
Rclass (h) .
• •
Mehryar Mohri - Foundations of Machine Learning
page 37
Proof Ideas QuickSort decomposition: 1 puv + 3
⇤
puvw
w⇥ {u,v}
⇥ h(u, w)h(w, v) + h(v, w)h(w, u) = 1.
Bipartite property: u (u, v) +
(v, w) +
(w, u) =
(v, u) +
(w, v) +
(u, w).
Mehryar Mohri - Foundations of Machine Learning
v
w
page 38
Lower Bound Theorem: for any deterministic algorithm A , there is a bipartite distribution for which Rrank (A)
2 Rclass (h).
• thus, factor of 2 = best in deterministic case. • randomization necessary for better bound.
Proof: take simple case U = V = {u, v, w} and assume that h induces a cycle. u up to symmetry, A returns
•
u, v, w or w, v, u. Mehryar Mohri - Foundations of Machine Learning
v
h page 39
w
Lower Bound If A returns u, v, w , then choose as: u, v
w
u
v
h
- + If A returns w, v, u , then choose as: w, v
L[h, L[A,
1 ]= ; 3 2 ]= . 3
u
- + Mehryar Mohri - Foundations of Machine Learning
page 40
w
Guarantees - General Case Loss bound for QuickSort: V,
E [L(Qhs (V ), ,s
)]
2 E [L(h,
)].
V,
Comparison with optimal ranking (see (CSS 99)): Es [L(Qhs (V ),
optimal )]
Es [L(h, Qhs (V ))]
where
optimal
3 L(h,
2 L(h,
optimal )
optimal ),
= argmin L(h, ).
Mehryar Mohri - Foundations of Machine Learning
page 41
Weight Function Generalization: ⇥ (u, v) =
(u, v) ⇤(
(u),
(v)).
Properties: needed for all previous results to hold, symmetry: (i, j) = (j, i) for all i, j . (i, k) for i < j < k . monotonicity: (i, j), (j, k) triangle inequality: (i, j) (i, k) + (k, j) for all triplets i, j, k.
• • •
Mehryar Mohri - Foundations of Machine Learning
page 42
Weight Function - Examples Kemeny: Top-k:
w(i, j) = 1, w(i, j) =
Bipartite: w(i, j) =
i, j.
1 if i k or j 0 otherwise.
k;
1 if i k and j > k; 0 otherwise.
k-partite: can be defined similarly.
Mehryar Mohri - Foundations of Machine Learning
page 43
(Strong) Regret Definitions Ranking regret: Rrank (A) = E [L(As (V ), ⇥ )] V,⇥ ,s
min E [L(˜|V , ⇥ )]. ˜
V,⇥
Preference regret: Rclass (h) = E [L(h|V , V,
)]
˜ |V , min E [L(h ˜ h
V,
)].
All previous regret results hold if for V1 , V2 E [ |V1
(u, v)] = E [ |V2
{u, v} ,
(u, v)]
for all u, v (pairwise independence on irrelevant alternatives). Mehryar Mohri - Foundations of Machine Learning
page 44
References •
Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. JMLR 6, 393–425.
•
Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).
•
Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.
•
Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604– 619). Springer.
•
Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic (2012). Accuracy at the top. In NIPS 2012.
•
Cohen, W. W., Schapire, R. E., and Singer,Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.
•
Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).
Mehryar Mohri - Courant Seminar
page 45
Courant Institute, NYU
References •
Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.
•
Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.
•
Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.
•
Crammer, K., and Singer,Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001,Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.
•
Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.
• •
J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.
Mehryar Mohri - Courant Seminar
page 46
Courant Institute, NYU
References •
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982.
•
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.
•
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press. 2012.
•
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.
•
Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.
•
Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.
•
Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.
Mehryar Mohri - Courant Seminar
page 47
Courant Institute, NYU