Yoav Freund. Rob Schapire ... [Freund & Schapire '96]. [Quinlan '96] ... [Schapire,
Freund, Bartlett & Lee '97]. [Breiman '97] ..... 2 card. 3 my home. 4 person? person
. 5 code. 6 I. 7 time. 8 wrong number. 9 how ... hello (Rate) yes I'd like to make ...
A Tutorial on Boosting Yoav Freund Rob Schapire
www.research.att.com/˜yoav www.research.att.com/˜schapire
Example: “How May I Help You?” [Gorin et al.]
goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.) yes I’d like to place a collect call long distance please (Collect) operator I need to make a call but I need to bill it to my office (ThirdNumber) yes I’d like to place a call on my master card please (CallingCard) I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)
observation observation: easy to find “rules of thumb” that are “often” correct e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ” hard to find single highly accurate prediction rule
The Boosting Approach select small subset of examples derive rough rule of thumb examine 2nd set of examples derive 2nd rule of thumb repeat T times questions: how to choose subsets of examples to examine on each round? how to combine all the rules of thumb into single prediction rule? boosting = general method of converting rough rules of thumb into highly accurate prediction rule
Tutorial outline first half (Rob): behavior on the training set background AdaBoost analyzing training error experiments connection to game theory confidence-rated predictions multiclass problems boosting for text categorization second half (Yoav): understanding AdaBoost’s generalization performance
The Boosting Problem “strong” PAC algorithm for any distribution 8 > 0 > 0 given polynomially many random examples finds hypothesis with error with probability 1; “weak” PAC algorithm same, but only for 12 ; [Kearns & Valiant ’88]:
does weak learnability imply strong learnability?
Early Boosting Algorithms [Schapire ’89]:
first provable boosting algorithm call weak learner three times on three modified distributions get slight boost in accuracy apply recursively [Freund ’90]:
“optimal” algorithm that “boosts by majority” [Drucker, Schapire & Simard ’92]:
first experiments using boosting limited by practical drawbacks
AdaBoost [Freund & Schapire ’95]:
AdaBoost introduced “AdaBoost AdaBoost” algorithm strong practical advantages over previous boosting algorithms experiments using AdaBoost: [Drucker & Cortes ’95]
[Schapire & Singer ’98]
[Jackson & Craven ’96]
[Maclin & Opitz ’97]
[Freund & Schapire ’96]
[Bauer & Kohavi ’97]
[Quinlan ’96]
[Schwenk & Bengio ’98]
[Breiman ’96]
[Dietterich ’98]
.. continuing development of theory and algorithms: [Schapire, Freund, Bartlett & Lee ’97] [Schapire & Singer ’98] [Breiman ’97]
[Mason, Bartlett & Baxter ’98]
[Grove & Schuurmans ’98]
[Friedman, Hastie & Tibshirani ’98]
..
A Formal View of Boosting given training set
(x1 y1) : : : (xm ym )
yi 2 f;1 +1g correct label of instance xi 2 X for t = 1 : : : T : construct distribution Dt on f1 : : : mg find weak hypothesis (“rule of thumb”) ht : X ! f;1 +1g with small error t on Dt: t = PrDt ht(xi) 6= yi]
output final hypothesis Hfinal
AdaBoost [Freund & Schapire]
constructing Dt : D1(i) = 1=m given Dt and ht:
Dt+1(i)
= =
where
8 > ( ) >< ; >> t :
Dt i e if yi = ht(xi) Zt e if yi 6= ht(xi) Dt(i) exp(; y h (x )) t i t i Zt
Zt = normalization constant 1 0 1 ; t CCA > 0 t = 12 ln BB@ t
final hypothesis hypothesis:
H
0 BBX ( ) = sign @ final
x
t
t
1 C ( )CA
t h t x
Toy Example
D1
Round 1 1111 0000 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 1111 111111111111 0000 000000000000
h1
D2
ε1 =0.30 α1=0.42
Round 2 1111 0000 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 1111 111111111111 0000 000000000000 111111111111 000000000000 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 000000000000 111111111111 11111 000000000000 00000 111111111111 11111
h2
D3
ε2 =0.21 α2=0.65
Round 3 1111 0000 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 0000 1111 000000000000 111111111111 1111 111111111111 0000 000000000000 111111111111 000000000000 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 11111 00000 000000000000 111111111111 11111 000000000000 00000 111111111111 11111 111111111111111 000000000000000 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 h3 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 111111111111111 000000000000000 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111 000000000000000 111111111111111
ε3 =0.14 α3=0.92
Final Hypothesis H final
= sign
111111111 00 0000000 0000000 1111111 00 11 0000000 1111111 00 11 001111111 11 0000000 0000000 1111111 00 11 001111111 11 0000000 001111111 11 0000000 0000000 1111111 00 11 001111111 11 0000000 00 11 0000000 1111111 0.4211 + 001111111 0000000 001111111 11 0000000 0000000 1111111 00 11 001111111 11 0000000 001111111 11 0000000 001111111 11 0000000 001111111 0000000 11
=
00 11 11111111 00000000 00000000 11111111 00 11 00000000 11111111 00 11 00 11 00000000 11111111 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 + 11 0.65 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 11111111 00 11 00000000 00 11 11111111 00000000 00 11 11111111
111111111 000000000 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 0.92 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111 000000000 111111111
1111111111111 0000000000000 00000 11111 000000000000 111111111111 000000000000 111111111111 0000000000000 1111111111111 00000 11111 0000000000000 1111111111111 00000 11111 000000000000 111111111111 0000000000000 1111111111111 00000 11111 000000000000 111111111111 0000000000000 1111111111111 00000 11111 000000000000 111111111111 0000000000000 1111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 0000000000000 1111111111111 00000 11111 0000000000000 1111111111111 00000 11111 000000000000 111111111111 00000 11111 0000000000000 1111111111111 000000000000 111111111111 0000000000000 000000000000 111111111111 1111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 000000000000 111111111111 00000 11111 00000 11111 000000000000 111111111111 00000 11111 000000000000 111111111111 00000 000000000000 11111 111111111111
* See demo at www.research.att.com/˜yoav/adaboost
Analyzing the training error Theorem Theorem: run AdaBoost let t = 1=2 ; t then training error(Hfinal)
2 s Y4
t
2
t(1 ; t
3 )5
v u Yu = t1 ; 4 2
t
t
0 1 X 2CC B exp B@;2 A
t t
so: if 8t : t > 0 2T 2 ; then training error(Hfinal) e adaptive adaptive: does not need to know or T a priori can exploit t
Proof let f (x) = X tht(x) ) Hfinal(x) = sign(f (x))
t
Step 11: unwrapping recursion:
Dfinal(i)
=
=
1
m
0 exp BB@;
yi X tht(xi t Y Zt t
1 C )CA
;y f (x ) e m Y Zt t
1
i
i
Step 22: training error(Hfinal) Y Zt
t
Proof:
Hfinal(x) 6= y ) yf (x) 0 ) e;yf (x) 1 so: training error(Hfinal)
=
1
8 > > > :0
mi
if yi 6= Hfinal(xi) else
1 X ;yif (xi) e = =
mi X Dfinal(i)Y Zt t i Y Zt t
Step 33:
Proof (cont.)
s
Zt = 2 t(1 ; t)
Proof:
Zt
= =
X
i
Dt(i) exp(;t yi ht(xi)) X
=
t
X
+
i:y 6=h (x ) i:y =h (x ) ; e + ( 1 ; ) e t t s 2 t (1 ; t ) t
i
=
Dt(i)e
t
i
i
t
t
i
Dt(i)e;
t
UCI Experiments [Freund & Schapire]
tested AdaBoost on UCI benchmarks used: C4.5 (Quinlan’s decision tree algorithm) “decision decision stumps stumps”: very simple rules of thumb that test on single attributes eye color = brown ? yes
no predict -1
yes predict -1
30
30
25
25
20
20
C4.5
C4.5
predict +1
height > 5 feet ?
15
15
10
10
5
5
0
0 0
5
10
15
20
25
30
boosting Stumps
no predict +1
0
5
10
15
20
25
boosting C4.5
30
Game Theory game defined by matrix M: Rock Paper Scissors
Rock 1=2 0 1
Paper Scissors 1 0 1=2 1 0 1=2
row player chooses row i column player chooses column j (simultaneously) row player’s goal goal: minimize loss M(i j ) usually allow randomized play: players choose distributions P and Q over rows and columns learner’s (expected) loss = =
X
ij
P(i)M(i j )Q(j )
PTMQ M(P Q)
The Minmax Theorem von Neumann’s minmax theorem: min max M(P Q) P
Q
=
max min M(P Q) Q
P
=
v
=
value of game M “value value”
in words words: v = min max means: row player has strategy P such that 8 column strategy Q loss M(P Q) v v = max min means: this is optimal in sense that column player has strategy Q such that 8 row strategy P loss M(P Q ) v
The Boosting Game row player $ booster column player $ weak learner matrix M: row $ example (xi yi) column $8weak hypothesis h > > < 1 if yi = h(xi ) M(i h) = >> : 0 else
Boosting and and the Minmax Minmax Theorem if if: 8 distributions over examples 9h with accuracy 12 ;
then then: min max M(P h) 12 ; P
h
by minmax theorem theorem: max min M(i Q) 12 ; > 12 Q
i
which means means: 9 weighted majority of hypotheses which correctly classifies all examples
AdaBoost and Game Theory [Freund & Schapire]
AdaBoost is special case of general algorithm for solving games through repeated play can show distribution over examples converges to (approximate) minmax strategy for boosting game weights on weak hypotheses converge to (approximate) maxmin strategy different instantiation of game-playing algorithm gives on-line learning algorithms (such as weighted majority algorithm)
Confidence-rated Predictions [Schapire & Singer]
useful to allow weak hypotheses to assign confidences to predictions formally, allow ht : X ! R sign(ht(x))
=
jht(x)j =
prediction “confidence”
use identical update:
D (i) t Dt+1(i) = Z exp(;t yi ht(xi)) t and identical rule for combining weak hypotheses questions: how to choose ht’s (specifically, how to assign confidences to predictions) how to choose t’s
Confidence-rated Confidence-rated Predictions Predictions (cont.) (cont.) Theorem Theorem: training error(Hfinal) Y Zt
t
Proof: same as before therefore, on each round t, should choose ht and t to minimize:
Zt = X Dt(i) exp(;t yi ht(xi)) i given ht, can find t which minimizes Zt analytically (sometimes) numerically (in general) should design weak learner to minimize Zt e.g.: for decision trees, criterion gives: splitting rule assignment of confidences at leaves
Minimizing Exponential Loss AdaBoost attempts to minimize: TY 1X Zt = exp(;yif (xi))
mi
t=1
=
1
0 X exp BB@;
mi
yi X tht(xi t
1 C )CA
(*)
really a steepest descent procedure: each round, add term tht to sum to minimize ( ) why this loss function? upper bound on training (classification) error easy to work with connection to logistic regression [Friedman, Hastie & Tibshirani]
Multiclass Problems say y 2 Y
= f1 : : : k g
direct approach (AdaBoost.M1):
ht : X ! Y 8 > ( ) >< ; > t > :
e if yi = ht(xi) e if yi 6= ht(xi) Hfinal(x) = arg max X t y2Y t:h (x)=y
Dt+1(i) = DZt i t
t
t
can prove same bound on error if 8t :
t 1=2
in practice, not usually a problem for “strong” weak learners (e.g., C4.5) significant problem for “weak” weak learners (e.g., decision stumps)
Reducing to Binary Problems [Schapire & Singer]
e.g.: say possible labels are fa b c d eg each training example replaced by five f;1 +1g-labeled examples:
xc
8 > > > ( > > > > > ( > > < ! >> ( > > > ( > > > > > :(
x a) x b) x c) x d) x e)
;1 ;1 +1
;1 ;1
AdaBoost.MH formally:
ht : X Y
! f;1 +1g(or R )
D (i y ) t Dt+1(i y) = Z exp(;t vi(y) ht(xi y)) t 8 > > < +1 if yi = y where vi(y) = >> : ;1 if yi 6= y X Hfinal(x) = arg max tht(x y) y2Y t can prove: training error(Hfinal
k Y ) Z 2
t
Using Output Codes [Schapire & Singer]
alternative: reduce to “random” binary problems choose “code word” for each label
1 2 3 4 a b c d e
; + ; ; + + + ; ; + ; + ; + ;
+
; + +
;
each training example mapped to one example per column 8 > > > (x 1) +1 > > > > < (x 2) ;1 x c ! >>> (x ) ;1 > 3 > > > : (x 4) +1 to classify new example x: evaluate hypothesis on (x 1) : : : (x 4) choose label “most consistent” with results training error bounds independent of # of classes may be more efficient for very large # of classes
Example: Example: Boosting Boosting for for Text Text Categorization Categorization [Schapire & Singer]
weak hypotheses: very simple weak hypotheses that test on simple patterns, namely, (sparse) n-grams find parameter t and rule ht of given form which minimize Zt use efficiently implemented exhaustive search “How may I help you” data: 7844 training examples (hand-transcribed) 1000 test examples (both hand-transcribed and from speech recognizer) categories: AreaCode, AttService, BillingCredit, CallingCard, Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.
Weak Hypotheses rnd term 1 collect
2 card
3 my home
4 person ? person
5 code
6 I
7 time
8 wrong number
9 how
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
More Weak Hypotheses rnd term 10 call
11 seven
12 trying to
13 and
14 third
15 to
16 for
17 charges
18 dial
19 just
AC AS BC CC CO CM DM DI HO PP RA 3N TI TC OT
Learning Curves 50 conf (test ) conf (train) noconf (test ) noconf (train)
45 40
error (%)
35 30 25 20 15 10 5 0 1
10
100 1000 # rounds of boosting
10000
test error reaches 20% for the first time on round... 1,932 without confidence ratings 191 with confidence ratings test error reaches 18% for the first time on round... 10,909 without confidence ratings 303 with confidence ratings
Finding Outliers examples with most weight are often outliers (mislabeled and/or ambiguous) I’m trying to make a credit card call (Collect) hello (Rate) yes I’d like to make a long distance collect call please (CallingCard) calling card please (Collect) yeah I’d like to use my calling card number
(Collect) can I get a collect call (CallingCard) yes I would like to make a long distant telephone call and have the charges billed to another number
(CallingCard DialForMe) yeah I can not stand it this morning I did oversea call is so bad (BillingCredit) yeah special offers going on for long distance
(AttService Rate) mister allen please william allen
(PersonToPerson)
yes ma’am I I’m trying to make a long distance call to a non dialable point in san miguel philippines (AttService Other) yes I like to make a long distance call and charge it to my home phone that’s where I’m calling at my home (DialForMe) I like to make a call and charge it to my ameritech (Competitor)