Python & ML ⼠6 years. ⢠sklearn dev since 2010 ... Slow to train (but fast to predict). ⢠Cannot ... Written
Gradient Boosted Regression Trees
scikit
Peter Prettenhofer (@pprett)
Gilles Louppe (@glouppe)
DataRobot
Universit´e de Li`ege, Belgium
Motivation
Motivation
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in scikit-learn
4 Case Study: California housing
About us Peter • @pprett • Python & ML ∼ 6 years • sklearn dev since 2010
Gilles • @glouppe • PhD student (Li` ege,
Belgium) • sklearn dev since 2011
Chief tree hugger
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in scikit-learn
4 Case Study: California housing
Machine Learning 101 • Data comes as... • A set of examples {(xi , yi )|0 ≤ i < n samples}, with • Feature vector x ∈ Rn features , and • Response y ∈ R (regression) or y ∈ {−1, 1} (classification)
• Goal is to... • Find a function yˆ = f (x) • Such that error L(y , yˆ ) on new (unseen) x is minimal
Classification and Regression Trees [Breiman et al, 1984]
MedInc >> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33])
Implementation • Written in pure Python/Numpy (easy to extend). • Builds on top of sklearn.tree.DecisionTreeRegressor (Cython). • Custom node splitter that uses pre-sorting (better for shallow trees).
Example from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)
10 8 6
ground truth RT max_depth=1 RT max_depth=3 GBRT max_depth=1 High bias - low variance
4
y
2 0 2 4 Low bias - high variance
6 80
2
4
x
6
8
10
Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
2.0
Test Train
Error
1.5
1.0
Lowest test error
0.5 train-test gap 0.0 0
200
400
n_estimators
600
800
1000
Model complexity & Overfitting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
2.0
Test Train
Regularization 1.5
GBRT provides a number of knobs to control overfitting Error
• Tree structure Lowest test error 1.0 • Shrinkage • Stochastic Gradient Boosting 0.5 train-test gap 0.0 0
200
400
n_estimators
600
800
1000
Regularization: Tree structure • The max depth of the trees controls the degree of features interactions • Use min samples leaf to have a sufficient nr. of samples per leaf.
Regularization: Shrinkage • Slow learning by shrinking tree predictions with 0 < learning rate >> est.feature_importances_ array([ 0.01, 0.38, ...])
MedInc AveRooms Longitude AveOccup Latitude AveBedrms Population HouseAge 0.00
0.02
0.04
0.06
0.08 0.10 0.12 Relative importance
0.14
0.16
0.18
Model interpretation What is the effect of a feature on the response? from sklearn.ensemble import partial_dependence import as pd
Partial dependence
-0.12
0.09
0.2
3
0.02
0.16
-0.05
Partial dependence
Partial dependence of house value on nonlocation features for the California housing dataset 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.2 0.4 0.4 1.5 3.0 4.5 6.0 7.5 2.0 2.5 3.0 3.5 4.0 4.5 10 20 30 40 50 60 MedInc AveOccup HouseAge 0.6 50 0.4 40 0.2 30 0.0 20 0.2 0.4 10 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 AveRooms AveOccup 0.6 0.4 0.2 0.0 0.2 0.4
HouseAge
Partial dependence
Partial dependence
features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names)
Model interpretation
Automatically detects spatial effects 0.97
0.57
0.66
0.49 0.41 partial dep. on median house value
partial dep. on median house value
0.34
0.33
-0.28
0.25
latitude
latitude
0.03
0.17
-0.60
0.09
-0.91
0.01
-0.07
-1.22 longitude
-1.54
-0.15 longitude
Summary
• Flexible non-parametric classification and regression technique • Applicable to a variety of problems • Solid, battle-worn implementation in scikit-learn
Thanks! Questions?
Test time Train time
Error
1.2 1.0 0.8 0.6 0.4 0.2 0.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.0 0.8 0.6 0.4 0.2 0.0
dataset bioresp
YahooLTRC
Spam
Solar
Madelon
Expedia
Example 10.2
Covtype
California
Boston
Arcene
Benchmarks
gbm sklearn-0.15
Tipps & Tricks 1
Input layout Use dtype=np.float32 to avoid memory copies and fortan layout for slight runtime benefit. X = np.asfortranarray(X, dtype=np.float32)
Tipps & Tricks 2
Feature interactions GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2 : 10 (left), 1000 (right).
0.3
1.0
0.2 x-y
0.0
0.0
0.1
0.5
0.2
1.0
0.8
0.6
x
0.4
0.2
0.0 1.0
0.8
0.6
0.4 y
0.2
x-y
0.5
0.1
0.3 0.0 1.0
0.8
0.6
x
0.4
0.2
0.0 1.0
0.8
0.6
0.4 y
0.2
1.0 0.0
Tipps & Tricks 3
Categorical variables Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)