Dimension Reduction and Variable Selection

Ensemble models and Gradient Boosting. Leonardo Auslender Independent Statistical Consultant Leonardo.Auslender ‘at’ Gmail ‘dot’ com. Copyright 2014. 3/23/2018

LeonardoLeonardo Auslender Copyright 2004 Auslender

1

Topics to cover: 1) Why more techniques? Bias-variance tradeoff.

2) Ensembles 1) Bagging – stacking 2) Random Forests 3) Gradient Boosting (GB) 4) Gradient-descent optimization method. 5) Innards of GB. 6) Overall Ensembles. 7) Partial Dependency Plots (PDP) 8) Case Study. 9) Xgboost 10)On the practice of Ensembles. 11)References.

3/23/2018


2

3/23/2018


3

1) Why more techniques? Bias-variance tradeoff. (Broken clock is right twice a day, variance of estimation = 0, bias extremely high. Thermometer is accurate overall, but reports higher/lower temperatures at night. Unbiased, higher variance. Betting on same horse always has zero variance, possibly extremely biased).

Model error can be broken down into three components mathematically. Let f be estimating function. f-hat empirically derived function.

Bet on many horse Horses and win.

Bet on right Horse and win.

Bet on wrong Horse and lose.

3/23/2018

Bet on many horses Horses and lose. LeonardoLeonardo Auslender Copyright 2004 Auslender

4

Credit : Scott Fortmann-Roe (web) 3/23/2018


5

Let X1, X2, X3,,, i.i.d random variables

Well known that E(X) =

, and variance (E(X)) = 



n

By just averaging estimates, we lower variance and assure same aspects of bias. Let us find methods to lower or stabilize variance (at least) while keeping low bias. And maybe also, lower the bias. And since cannot be fully attained, still searching for more techniques.  Minimize general objective function:

Obj(Θ)  L(Θ)  Ω(Θ), L(Θ)  Minimize loss function to reduce bias. Ω(Θ)  Regularization, minimize model complexity. where Ω  {w1,,,,,, wp }, set of model parameters. 3/23/2018


6

3/23/2018


7

Some terminology for Model combinations. Ensembles: general name Prediction/forecast combination: focusing on just outcomes Model combination for parameters: Bayesian parameter averaging

We focus on ensembles as Prediction/forecast combinations.

3/23/2018


8

3/23/2018


Ch. 5-9

Ensembles. Bagging (bootstrap aggregation, Breiman, 1996): Adding randomness  improves function estimation. Variance reduction technique, reducing MSE. Let initial data size n. 1) Construct bootstrap sample by randomly drawing n times with replacement (note, some observations repeated). 2)

Compute sample estimator (logistic or regression, tree, ANN … Tree in practice).

3)

Redo B times, B large (50 – 100 or more in practice, but unknown).

4)

Bagged estimator. For classification, Breiman recommends majority vote of classification for each observation. Buhlmann (2003) recommends averaging bootstrapped probabilities. Note that individual obs may not appear B times each.

NB: Independent sequence of trees. What if …….?

Reduces prediction error by lowering variance of aggregated predictor while maintaining bias almost constant (variance/bias trade-off). Friedman (1998) reconsidered boosting and bagging in terms of gradient descent algorithms, seen later on. 3/23/2018


Ch. 5-10

From http://dni-institute.in/blogs/bagging-algorithm-concepts-with-example/

3/23/2018


11

Ensembles Evaluation: Empirical studies: boosting (seen later) smaller misclassification rates compared to bagging, reduction of both bias and variance. Different boosting algorithms (Breiman’s arc-x4 and arcgv). In cases with substantial noise, bagging performs better. Especially used in clinical studies. Why does Bagging work? Breiman: bagging successful because reduces instability of prediction method. Unstable: small perturbations in data  large changes in predictor. Experimental results show variance reduction. Studies suggest that bagging performs some smoothing on the estimates. Grandvalet (2004) argues that bootstrap sampling equalizes effects of highly influential observations. Disadvantage: cannot be visualized easily. 3/23/2018


Ch. 5-12

Ensembles Adaptive Bagging (Breiman, 2001): Mixes bias-reduction boosting with variance-reduction bagging. Uses out-of-bag obs to halt optimizer.

Stacking: Previously, same technique used throughout. Stacking (Wolpert 1992) combines different algorithms on single data set. Voting is then used for final classification. Ting and Witten (1999) “stack” the probability distributions (PD) instead. Stacking is “meta-classifier”: combines methods. Pros: takes best from many methods. Cons: un-interpretable, mixture of methods become black-box of predictions. Stacking very prevalent in WEKA. 3/23/2018


Ch. 5-13

5.3) 5.3.1)

Tree World. L. Breiman: Bagging.

2.2) L. Breiman: Random Forests

3/23/2018


Ch. 5-14

Explanation by way of football example for The Saints. https://gormanalysis.com/random-forest-from-top-to-bottom/ SaintsAtHo Expert1Pre Expert2Pre me dWin dWin

Opponent

OppRk

SaintsWon

1

Falcons

28

TRUE

TRUE

TRUE

TRUE

2

Cowgirls

16

TRUE

TRUE

TRUE

TRUE

3

Eagles

30

FALSE

FALSE

TRUE

TRUE

4

Bucs

6

TRUE

FALSE

TRUE

FALSE

5

Bucs

14

TRUE

FALSE

FALSE

FALSE

6

Panthers

9

FALSE

TRUE

TRUE

FALSE

7

Panthers

18

FALSE

FALSE

FALSE

FALSE

Goal: predict when Saints will win. 5 Predictors: Opponent, opponent rank, home game, expert1 and expert2 predictions. If run tree, just one split on opponent because Saints lost to Bucs and Panthers and perfect separation then, but useless for future opponents. Instead, at each step randomly select subset of 3 (or 2, or 4) features and grow multiple weak but different trees, which when combined, should be a smart model. 3 Examples: Tree2 Tree3 OppRank

Dimension Reduction and Variable Selection

Dimension Reduction and Variable Selection

Suggest Documents

Dimension Reduction and Variable Selection

Dimension Reduction and Variable Selection

Dimension reduction and variable selection in case control studies via ...

Learning sparse gradients for variable selection and dimension ... - arXiv

feature selection, learning metrics and dimension reduction in training ...

Selection of variables and dimension reduction in high ... - arXiv

Band Selection for Dimension Reduction in Hyper Spectral ... - ijmlc

On the selection of dimension reduction ... - Semantic Scholar

Subset selection in dimension reduction methods - Dipartimento di ...

Band Selection for Dimension Reduction in Hyper Spectral Image ...

Selection principles and countable dimension

Supervised dimension reduction mappings

Most Informative Dimension Reduction

Variable dimension algorithms: Basic theory, interpretations and

Trace Optimization and Eigenproblems in Dimension Reduction

Dimension Reduction and Data Visualization ... - Semantic Scholar

Denoising and Dimension Reduction in Feature Space

Fusion Frames and Robust Dimension Reduction

Sufficient Dimension Reduction and Prediction in Regression

sufficient dimension reduction based on normal and

Lecture 6: Variable Selection

Lecture 6: Variable Selection

Convex Optimization Methods for Dimension Reduction and ...

Homogenization, linearization and dimension reduction in elasticity