Dimension Reduction and Variable Selection

5 downloads 0 Views 9MB Size Report
Mar 23, 2018 - Friedman (1998) reconsidered boosting and bagging in terms of gradient ... Opponent. OppRk. SaintsAtH ome. Expert1Pr. edWin. Expert2Pr.
Ensemble models and Gradient Boosting. Leonardo Auslender Independent Statistical Consultant Leonardo.Auslender ‘at’ Gmail ‘dot’ com. Copyright 2014. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

1

Topics to cover: 1) Why more techniques? Bias-variance tradeoff.

2) Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB. 6) Overall Ensembles. 7) Partial Dependency Plots (PDP) 8) Case Study. 9) Xgboost 10)On the practice of Ensembles. 11)References.

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

2

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

3

1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased).

Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function.

Bet on many horse Horses and win.

Bet on right Horse and win.

Bet on wrong Horse and lose.

3/23/2018

Bet on many horses Horses and lose. LeonardoLeonardo Auslender Copyright 2004 Auslender

4

Credit : Scott Fortmann-Roe (web) 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

5

Let X1, X2, X3,,, i.i.d random variables

Well known that E(X) =

, and variance (E(X)) = 



n

By just averaging estimates, we lower variance and assure same aspects of bias. Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. And since cannot be fully attained, still searching for more techniques.  Minimize general objective function:

Obj(Θ)  L(Θ)  Ω(Θ), L(Θ)  Minimize loss function to reduce bias. Ω(Θ)  Regularization, minimize model complexity. where Ω  {w1,,,,,, wp }, set of model parameters. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

6

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

7

Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging

We focus on ensembles as Prediction/forecast combinations.

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

8

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

Ch. 5-9

Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness  improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2)

Compute sample estimator (logistic or regression, tree, ANN … Tree in practice).

3)

Redo B times, B large (50 – 100 or more in practice, but unknown).

4)

Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each.

NB: Independent sequence of trees. What if …….?

Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

Ch. 5-10

From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

11

Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arcgv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data  large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

Ch. 5-12

Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer.

Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

Ch. 5-13

5.3) 5.3.1)

Tree World. L. Breiman: Bagging.

2.2) L. Breiman: Random Forests

3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

Ch. 5-14

Explanation by way of football example for The Saints. https://gormanalysis.com/random-forest-from-top-to-bottom/ SaintsAtHo Expert1Pre Expert2Pre me dWin dWin

Opponent

OppRk

SaintsWon

1

Falcons

28

TRUE

TRUE

TRUE

TRUE

2

Cowgirls

16

TRUE

TRUE

TRUE

TRUE

3

Eagles

30

FALSE

FALSE

TRUE

TRUE

4

Bucs

6

TRUE

FALSE

TRUE

FALSE

5

Bucs

14

TRUE

FALSE

FALSE

FALSE

6

Panthers

9

FALSE

TRUE

TRUE

FALSE

7

Panthers

18

FALSE

FALSE

FALSE

FALSE

Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 OppRank