Mar 23, 2018 - Friedman (1998) reconsidered boosting and bagging in terms of gradient ... Opponent. OppRk. SaintsAtH ome. Expert1Pr. edWin. Expert2Pr.
Ensemble models and Gradient Boosting. Leonardo Auslender Independent Statistical Consultant Leonardo.Auslender ‘at’ Gmail ‘dot’ com. Copyright 2014. 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
1
Topics to cover: 1) Why more techniques? Bias-variance tradeoff.
2) Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB. 6) Overall Ensembles. 7) Partial Dependency Plots (PDP) 8) Case Study. 9) Xgboost 10)On the practice of Ensembles. 11)References.
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
2
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
3
1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased).
Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function.
Bet on many horse Horses and win.
Bet on right Horse and win.
Bet on wrong Horse and lose.
3/23/2018
Bet on many horses Horses and lose. LeonardoLeonardo Auslender Copyright 2004 Auslender
4
Credit : Scott Fortmann-Roe (web) 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
5
Let X1, X2, X3,,, i.i.d random variables
Well known that E(X) =
, and variance (E(X)) =
n
By just averaging estimates, we lower variance and assure same aspects of bias. Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. And since cannot be fully attained, still searching for more techniques. Minimize general objective function:
Obj(Θ) L(Θ) Ω(Θ), L(Θ) Minimize loss function to reduce bias. Ω(Θ) Regularization, minimize model complexity. where Ω {w1,,,,,, wp }, set of model parameters. 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
6
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
7
Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging
We focus on ensembles as Prediction/forecast combinations.
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
8
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
Ch. 5-9
Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2)
Compute sample estimator (logistic or regression, tree, ANN … Tree in practice).
3)
Redo B times, B large (50 – 100 or more in practice, but unknown).
4)
Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each.
NB: Independent sequence of trees. What if …….?
Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on. 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
Ch. 5-10
From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
11
Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arcgv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily. 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
Ch. 5-12
Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer.
Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA. 3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
Ch. 5-13
5.3) 5.3.1)
Tree World. L. Breiman: Bagging.
2.2) L. Breiman: Random Forests
3/23/2018
LeonardoLeonardo Auslender Copyright 2004 Auslender
Ch. 5-14
Explanation by way of football example for The Saints. https://gormanalysis.com/random-forest-from-top-to-bottom/ SaintsAtHo Expert1Pre Expert2Pre me dWin dWin
Opponent
OppRk
SaintsWon
1
Falcons
28
TRUE
TRUE
TRUE
TRUE
2
Cowgirls
16
TRUE
TRUE
TRUE
TRUE
3
Eagles
30
FALSE
FALSE
TRUE
TRUE
4
Bucs
6
TRUE
FALSE
TRUE
FALSE
5
Bucs
14
TRUE
FALSE
FALSE
FALSE
6
Panthers
9
FALSE
TRUE
TRUE
FALSE
7
Panthers
18
FALSE
FALSE
FALSE
FALSE
Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 OppRank