Project "Rossmann Store Sales". Kaggle Competition Script Score: 0.18613. Rank: https://www.kaggle.com/kushal1412/results. Brief Summary of the Data.
Project "Rossmann Store Sales" Kaggle Competition Script Score: 0.18613 Rank: https://www.kaggle.com/kushal1412/results
Brief Summary of the Data We have historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.
Files train.csv - historical data including Sales test.csv - historical data excluding Sales store.csv - supplemental information about the stores
Data fields Id - an Id that represents a (Store, Date) duple within the test set Store - a unique Id for each store Sales - the turnover for any given day (this is what you are predicting) Customers - the number of customers on a given day Open - an indicator for whether the store was open: 0 = closed, 1 = open StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools StoreType - differentiates between 4 different store models: a, b, c, d Assortment - describes an assortment level: a = basic, b = extra, c = extended CompetitionDistance - distance in meters to the nearest competitor store CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened Promo - indicates whether a store is running a promo on that day Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2 PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.
Data Cleaning and Tranforming Importing the basic libraries and the data set into pandas. In [1]: import pandas as pd import statsmodels.api as sm import matplotlib as plt import numpy as np from sklearn import cross_validation as cv test = pd.read_csv('test.csv', low_memory=False, parse_dates = ['Date']) train = pd.read_csv('train.csv', low_memory=False, parse_dates = ['Date']) store = pd.read_csv('store.csv', low_memory=False)
Checking the statistics of test set. In [2]: test.describe() Out[2]: Id
Store
DayOfWeek
Open
Promo
SchoolHoliday
count 41088.000000 41088.000000 41088.000000 41077.000000 41088.000000 41088.000000 mean 20544.500000 555.899533
3.979167
0.854322
0.395833
0.443487
std
11861.228267 320.274496
2.015481
0.352787
0.489035
0.496802
min
1.000000
1.000000
0.000000
0.000000
0.000000
25%
10272.750000 279.750000
2.000000
1.000000
0.000000
0.000000
50%
20544.500000 553.500000
4.000000
1.000000
0.000000
0.000000
75%
30816.250000 832.250000
6.000000
1.000000
1.000000
1.000000
max
41088.000000 1115.000000
7.000000
1.000000
1.000000
1.000000
1.000000
Removing NaNs values from Open and replacing it with 1 assuming those 11 stores are open. In [3]: test.loc[test.Open.isnull(),'Open'] = 1 test.describe() Out[3]: Id
Store
DayOfWeek
Open
Promo
SchoolHoliday
count 41088.000000 41088.000000 41088.000000 41088.000000 41088.000000 41088.000000 mean 20544.500000 555.899533
3.979167
0.854361
0.395833
0.443487
std
11861.228267 320.274496
2.015481
0.352748
0.489035
0.496802
min
1.000000
1.000000
0.000000
0.000000
0.000000
25%
10272.750000 279.750000
2.000000
1.000000
0.000000
0.000000
50%
20544.500000 553.500000
4.000000
1.000000
0.000000
0.000000
75%
30816.250000 832.250000
6.000000
1.000000
1.000000
1.000000
max
41088.000000 1115.000000
7.000000
1.000000
1.000000
1.000000
1.000000
Adding Year, Month, and Week Cols and checking the statistics of train set. In [4]: train['Year'] = train['Date'].dt.year train['Month'] = train['Date'].dt.month train['Week'] = train['Date'].dt.week train.describe() Out[4]: Store
DayOfWeek
Sales
Customers
Open
Promo
SchoolHoliday
Year
count 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000 mean 558.429727
3.998341
5773.818972
633.145946
0.830107
0.381515
0.178647
2013.832292
std
321.908651
1.997391
3849.926175
464.411734
0.375539
0.485759
0.383056
0.777396
min
1.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
2013.000000
25%
280.000000
2.000000
3727.000000
405.000000
1.000000
0.000000
0.000000
2013.000000
50%
558.000000
4.000000
5744.000000
609.000000
1.000000
0.000000
0.000000
2014.000000
75%
838.000000
6.000000
7856.000000
837.000000
1.000000
1.000000
0.000000
2014.000000
max
1115.000000
7.000000
41551.000000
7388.000000
1.000000
1.000000
1.000000
2015.000000
Removing rows where Sales is zero and checking the training set. In [5]: train = train.loc[train.Sales > 0] train.describe() Out[5]: Store
DayOfWeek
Sales
Customers
Open
Promo
SchoolHoliday Year
M onth
count 844338.000000 844338.000000 844338.000000 844338.000000 844338 844338.000000 844338.000000 844338.000000 844338.0000 mean 558.421374
3.520350
6955.959134
762.777166
1
0.446356
0.193578
2013.831945
5.845774
std
321.730861
1.723712
3103.815515
401.194153
0
0.497114
0.395102
0.777271
3.323959
min
1.000000
1.000000
46.000000
8.000000
1
0.000000
0.000000
2013.000000
1.000000
25%
280.000000
2.000000
4859.000000
519.000000
1
0.000000
0.000000
2013.000000
3.000000
50%
558.000000
3.000000
6369.000000
676.000000
1
0.000000
0.000000
2014.000000
6.000000
75%
837.000000
5.000000
8360.000000
893.000000
1
1.000000
0.000000
2014.000000
8.000000
max
1115.000000
7.000000
41551.000000
7388.000000
1
1.000000
1.000000
2015.000000
12.000000
Checking the statistics of store set. In [6]: store.describe() Out[6]: Store
CompetitionDistance CompetitionOpenSinceM onth CompetitionOpenSinceYear Promo2
Promo2SinceWeek
count 1115.00000 1112.000000
761.000000
761.000000
1115.000000 571.000000
mean 558.00000
5404.901079
7.224704
2008.668857
0.512108
23.595447
std
322.01708
7663.174720
3.212348
6.195983
0.500078
14.141984
min
1.00000
20.000000
1.000000
1900.000000
0.000000
1.000000
25%
279.50000
717.500000
4.000000
2006.000000
0.000000
13.000000
50%
558.00000
2325.000000
8.000000
2010.000000
1.000000
22.000000
75%
836.50000
6882.500000
10.000000
2013.000000
1.000000
37.000000
max
1115.00000 75860.000000
12.000000
2015.000000
1.000000
50.000000
Removing null values from each column of store set and replacing it with their respective means and checking the store statistics again. In [7]: store.loc[store.CompetitionDistance.isnull(),'CompetitionDistance']=int(round(store.CompetitionDistance.mean ())) store.loc[store.CompetitionOpenSinceMonth.isnull(),'CompetitionOpenSinceMonth']=int(round(store.Competition OpenSinceMonth.mean())) store.loc[store.CompetitionOpenSinceYear.isnull(),'CompetitionOpenSinceYear']=int(round(store.CompetitionOp enSinceYear.mean())) store.loc[store.Promo2SinceWeek.isnull(),'Promo2SinceWeek']=int(round(store.Promo2SinceWeek.mean())) store.loc[store.Promo2SinceYear.isnull(),'Promo2SinceYear']=int(round(store.Promo2SinceYear.mean())) store.describe()
Out[7]: Store
CompetitionDistance CompetitionOpenSinceM onth CompetitionOpenSinceYear Promo2
Promo2SinceWeek
count 1115.00000 1115.000000
1115.000000
1115.000000
1115.000000 1115.000000
mean 558.00000
5404.901345
7.153363
2008.773991
0.512108
23.792825
std
322.01708
7652.849306
2.655365
5.120018
0.500078
10.117938
min
1.00000
20.000000
1.000000
1900.000000
0.000000
1.000000
25%
279.50000
720.000000
6.000000
2008.000000
0.000000
22.000000
50%
558.00000
2330.000000
7.000000
2009.000000
1.000000
24.000000
75%
836.50000
6875.000000
9.000000
2011.000000
1.000000
24.000000
max
1115.00000 75860.000000
12.000000
2015.000000
1.000000
50.000000
Merging the tables train and store and replacing string values in store columns for computational purpose and then checking the merged table statistics. In [8]: data = train.merge(store,on='Store') data = data.drop(['Open'], axis=1) data.StoreType.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.Assortment.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.StateHoliday.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.describe() Out[8]: Store
DayOfWeek
Sales
Customers
Promo
StateHoliday
SchoolHoliday Year
count 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 mean 558.421374
3.520350
6955.959134
762.777166
0.446356
0.001418
0.193578
2013.831945
std
321.730861
1.723712
3103.815515
401.194153
0.497114
0.047578
0.395102
0.777271
min
1.000000
1.000000
46.000000
8.000000
0.000000
0.000000
0.000000
2013.000000
25%
280.000000
2.000000
4859.000000
519.000000
0.000000
0.000000
0.000000
2013.000000
50%
558.000000
3.000000
6369.000000
676.000000
0.000000
0.000000
0.000000
2014.000000
75%
837.000000
5.000000
8360.000000
893.000000
1.000000
0.000000
0.000000
2014.000000
max
1115.000000
7.000000
41551.000000
7388.000000
1.000000
3.000000
1.000000
2015.000000
Initial Feature Engineering: Calculating the "competition open since" time in months. Calculating the "Promo open since" time in months. In [9]: data['CompetitionOpen'] = 12 * (data.Year - data.CompetitionOpenSinceYear) + \ (data.Month - data.CompetitionOpenSinceMonth) data['CompetitionOpen'] = data.CompetitionOpen.apply(lambda x: x if x > 0 else 0) data.drop(['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear'], axis = 1, inplace = True) data['PromoOpen'] = 12 * (data.Year - data.Promo2SinceYear) + \ (data.Week - data.Promo2SinceWeek) / float(4) data['PromoOpen'] = data.PromoOpen.apply(lambda x: x if x > 0 else 0) data.drop(['Promo2SinceYear', 'Promo2SinceWeek', 'PromoInterval'], axis = 1, inplace = True) data.describe()
Out[9]: Store
DayOfWeek
Sales
Customers
Promo
StateHoliday
SchoolHoliday Year
count 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 mean 558.421374
3.520350
6955.959134
762.777166
0.446356
0.001418
0.193578
2013.831945
std
321.730861
1.723712
3103.815515
401.194153
0.497114
0.047578
0.395102
0.777271
min
1.000000
1.000000
46.000000
8.000000
0.000000
0.000000
0.000000
2013.000000
25%
280.000000
2.000000
4859.000000
519.000000
0.000000
0.000000
0.000000
2013.000000
50%
558.000000
3.000000
6369.000000
676.000000
0.000000
0.000000
0.000000
2014.000000
75%
837.000000
5.000000
8360.000000
893.000000
1.000000
0.000000
0.000000
2014.000000
max
1115.000000
7.000000
41551.000000
7388.000000
1.000000
3.000000
1.000000
2015.000000
Exploratory Analysis Checking the correlation between the features in the data with respect to Sales. In [10]: data.corr()['Sales'] Out[10]: Store 0.007723 DayOfWeek -0.178753 Sales 1.000000 Customers 0.823552 Promo 0.368199 StateHoliday 0.020106 SchoolHoliday 0.038635 Year 0.036151 Month 0.073589 Week 0.074463 StoreType -0.016211 Assortment 0.109015 CompetitionDistance -0.036401 Promo2 -0.127556 CompetitionOpen -0.001543 PromoOpen 0.033863 Name: Sales, dtype: float64
Plotting histograms to see the skewness in the data. In [11]: import matplotlib.pyplot as pyplot data['Sales'].hist(bins=25, figsize=(16,5)) pyplot.title('Sales Histogram') Out[11]:
In [12]: data['Customers'].hist(bins=25, figsize=(16,5)) pyplot.title('Customers Histogram') Out[12]:
In [13]: data['Promo'].hist(bins=4, figsize=(16,5)) pyplot.title('Promo Histogram') Out[13]:
In [14]: data['Assortment'].hist(bins=12, figsize=(16,5)) pyplot.title('Assortment Histogram') Out[14]:
In [15]: data['StoreType'].hist(bins=16, figsize=(16,5)) pyplot.title('StoreType Histogram') Out[15]:
Visualizing the density distrbution of Customers above and below the its median. In [16]: med = data.Customers.median() print 'Median of Customers = %.2f'%(med) data.query('Customers = @med')['Customers'].plot(kind='kde', color='blue', figsize=(16,5)) Median of Customers = 676.00 Out[16]:
Checking the Scatter plot of Sales with respect to Customer In [17]: data.plot(kind='scatter', x='Customers', y='Sales', figsize=(12,6)) Out[17]: /usr/local/lib/python2.7/dist-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if self._edgecolors == str('face'):
Removing the outliers (Unlikely event) In [18]: data = data.query('Customers < 6000 and Sales < 40000') data.plot(kind='scatter', x='Customers', y='Sales', figsize=(12,6)) Out[18]:
Defining the data into predictor and response set and creating a basic prediction model by using OLS linear regression. Selecting only 2 features which have highest correlation with sales into the predictor set.
In [19]: pred = data[['Customers','Promo']].copy() # Adding a constant to the predictor set. pred = sm.add_constant(pred) # Taking sales as the response set. resp = data.Sales.copy() # Fitting the OLS model ols = sm.OLS(resp,pred) ols_fit = ols.fit() # Printing the summary print "AIC: %.2f || R2: %.2f"%(ols_fit.aic, ols_fit.rsquared) ols_fit.summary() AIC: 14876321.76 || R2: 0.73 Out[19]: OLS Regression Results Dep. Variable:
Sales
R-squared:
0.727
M odel:
OLS
Adj. R-squared:
0.727
M ethod:
Least Squares
F-statistic:
1.126e+06
Date:
Tue, 01 Dec 2015 Prob (F-statistic): 0.00
Time:
13:15:57
Log-Likelihood:
-7.4382e+06
No. Observations: 844336
AIC:
1.488e+07
Df Residuals:
844333
BIC:
1.488e+07
Df M odel:
2
Covariance Type: nonrobust coef const
std err t
1710.9245 3.916
Customers 6.0537 Promo Omnibus:
0.004
1405.6589 3.609
436.954
P>|t|
[95.0% Conf. Int.]
0.000 1703.250 1718.599
1353.640 0.000 6.045 6.062 389.527
0.000 1398.586 1412.732
132520.272 Durbin-Watson:
0.322
Prob(Omnibus): 0.000
Jarque-Bera (JB): 1827090.441
Skew:
0.310
Prob(JB):
0.00
Kurtosis:
10.180
Cond. No.
2.07e+03
Feature Engineering:- Adding new features to the defined predictor set. Sales per Customer (Amount of Sales on that Day per Customer on that day) Sales per Day of the Week Sales per Store Type (Amount of Sales in particular Store Type on that day)
In [20]: # New features Sales per Customer is improving the Linear Model by approximately 18 percent pred['SalesPerCust'] = data.Sales/data.Customers # New features Sales per Assortment is improving the Linear Model by approximately 1.1 percent pred['SalesPerDayOfWeek'] = data.Sales/data.DayOfWeek # New features Sales per StoreType is improving the Linear Model by approximately 1.1 percent pred['SalesPerType'] = data.Sales/data.StoreType # Fitting the OLS Model ols = sm.OLS(resp,pred) ols_fit = ols.fit() #Checking the results after Feature Engineering print "AIC: %.2f || R2: %.2f"%(ols_fit.aic, ols_fit.rsquared) ols_fit.summary() AIC: 13718133.13 || R2: 0.93 Out[20]: OLS Regression Results Dep. Variable:
Sales
R-squared:
0.931
M odel:
OLS
Adj. R-squared:
0.931
M ethod:
Least Squares
F-statistic:
2.273e+06
Date:
Tue, 01 Dec 2015 Prob (F-statistic): 0.00
Time:
13:15:57
Log-Likelihood:
-6.8591e+06
No. Observations: 844336
AIC:
1.372e+07
Df Residuals:
844330
BIC:
1.372e+07
Df M odel:
5
Covariance Type: nonrobust coef
std err t
P>|t|
[95.0% Conf. Int.]
const
-4974.2880 5.412
-919.178 0.000 -4984.895 -4963.681
Customers
6.4351
0.003
1967.010 0.000 6.429 6.442
Promo
161.5708
1.986
81.372
SalesPerCust
639.0728
0.484
1320.911 0.000 638.125 640.021
SalesPerDayOfWeek 0.1072
0.000
290.974
0.000 0.106 0.108
SalesPerType
0.000
356.952
0.000 0.119 0.121
0.1201
Omnibus:
419331.860 Durbin-Watson:
0.000 157.679 165.462
0.342
Prob(Omnibus): 0.000
Jarque-Bera (JB): 19045818.178
Skew:
-1.682
Prob(JB):
0.00
Kurtosis:
26.023
Cond. No.
4.19e+04
Trying to further improve the model by taking Polynomial Features. In [21]: for p in range (1,6): pred['Custmers^%d'%p] = pred.Customers.pow(p) ols = sm.OLS(resp,pred) ols_fit = ols.fit() print "Polynomial %d || AIC: %.2f || R2: %.4f"%(p, ols_fit.aic, ols_fit.rsquared) Polynomial Polynomial Polynomial Polynomial Polynomial
1 2 3 4 5
|| || || || ||
AIC: AIC: AIC: AIC: AIC:
13718135.32 13376105.34 13333519.58 13325533.80 14683939.64
|| || || || ||
R2: R2: R2: R2: R2:
0.9308 0.9539 0.9561 0.9566 0.7829
It can be observed here that after the 2nd Degree Polynomial there is no significant improvement in the Model by introducing the Polynomial Features. Splitting the predictor and response sets into training and test sets In [21]: from sklearn import cross_validation as cv pred_train, pred_test, resp_train, resp_test = cv.train_test_split( pred, resp, test_size=0.5, random_state=0)
Using the LinearRegression as the prediction model and verifying the predictor. In [22]: from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(pred_train, resp_train) lr_scr_1 = lr.score(pred_test, resp_test) print "Linear Model Accuracy: %.2f percent"%(100*lr_scr_1) Linear Model Accuracy: 93.10 percent
Checking how other Models perform using the same predictor set. In [23]: # Using the Decision Tree as the prediction model from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor(max_depth=10) dtr.fit(pred_train, resp_train) dtr_scr_1 = dtr.score(pred_test, resp_test) print "Decision Tree Model Accuracy: %.2f percent"%(100*dtr_scr_1) Decision Tree Model Accuracy: 99.45 percent In [24]: # Using the Tree Bagging as the prediction model from sklearn.ensemble import BaggingRegressor bgr = BaggingRegressor(n_estimators=10) bgr.fit(pred_train, resp_train) bgr_scr_1 = bgr.score(pred_test, resp_test) "Bagging Model Accuracy: %.2f percent"%(100*bgr_scr_1) Out[24]: 'Bagging Model Accuracy: 99.96 percent' In [25]: # Using the Random Forest as the prediction model from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(pred_train, resp_train) rfr_scr_1 = rfr.score(pred_test, resp_test) "Random Forest Model Accuracy: %.2f percent"%(100*rfr_scr_1) Out[25]: 'Random Forest Model Accuracy: 99.95 percent' In [26]: # Using the KNN as the prediction model from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor(n_neighbors=10, weights='uniform') knn.fit(pred_train, resp_train) knn_scr_1 = knn.score(pred_test, resp_test) print "KNN Model Accuracy: %.2f percent"%(100*knn_scr_1) KNN Model Accuracy: 98.34 percent
In [27]: # Using the Ada Boost as the prediction model from sklearn.ensemble import AdaBoostRegressor abr = AdaBoostRegressor(n_estimators=10) abr.fit(pred_train, resp_train) abr_scr_1 = abr.score(pred_test, resp_test) "Ada Boost Model Accuracy: %.2f percent"%(100*abr_scr_1) Out[27]: 'Ada Boost Model Accuracy: 78.03 percent' In [28]: # Using the Gradient Boost Regressor as the prediction model from sklearn.ensemble import GradientBoostingRegressor gbr = GradientBoostingRegressor(n_estimators=10) gbr.fit(pred_train, resp_train) gbr_scr_1 = gbr.score(pred_test, resp_test) "Gradient Boost Model Accuracy: %.2f percent"%(100*gbr_scr_1) Out[28]: 'Gradient Boost Model Accuracy: 75.28 percent'
Defining the predictor set again for sophisticated methods, as they can handle more features. In [29]: pred = data[['Customers','Promo','DayOfWeek','Assortment','StoreType','StateHoliday','SchoolHoliday', 'Promo2','CompetitionDistance','CompetitionOpen','PromoOpen']].copy() resp = data.Sales.copy() In [30]: from sklearn import cross_validation as cv pred_train, pred_test, resp_train, resp_test = cv.train_test_split( pred, resp, test_size=0.5, random_state=0)
Comparing and discussing different ensemble prediction approaches. Ref: http://scikit-learn.org/stable/modules/ensemble.html
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
Checking Linear Regression In [31]: from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(pred_train, resp_train) lr_scr_2 = lr.score(pred_test, resp_test) print "Linear Model Accuracy: %.2f percent"%(100*lr_scr_2) Linear Model Accuracy: 76.27 percent
Decision Tree In [32]: # Using the Decision Tree as the prediction model from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor(max_depth=10) dtr.fit(pred_train, resp_train) dtr_scr_2 = dtr.score(pred_test, resp_test) print "Decision Tree Model Accuracy: %.2f percent"%(100*dtr_scr_2) Decision Tree Model Accuracy: 88.15 percent
In [33]: # Using decision tree as regressor and cross validating to find an optimal depth. from sklearn.tree import DecisionTreeRegressor from sklearn.cross_validation import KFold max_score = 0 best_d = 0 for d in range(1,25): dtr_scr_2 = 0. dtr = DecisionTreeRegressor(max_depth=d) for train, test in KFold(len(pred), n_folds=5, shuffle=True): dtr.fit(pred_train, resp_train) dtr_scr_2 += dtr.score(pred_test,resp_test)/5 if dtr_scr_2 > max_score: max_score = dtr_scr_2 best_d = d print "Optimal depth of nodes for Decision Tree = %d || Model Accuracy = %.4f"%(best_d,max_score) Optimal depth of nodes for Decision Tree = 20 || Model Accuracy = 0.9346
1. Averaging Methods In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
(a) Tree Bagging In [34]: # Using the Tree Bagging as the prediction model from sklearn.ensemble import BaggingRegressor bgr = BaggingRegressor(n_estimators=10) bgr.fit(pred_train, resp_train) bgr_scr_2 = bgr.score(pred_test, resp_test) "Bagging Model Accuracy: %.2f percent"%(100*bgr_scr_2) Out[34]: 'Bagging Model Accuracy: 96.10 percent' In [35]: # Using tree bagging as regressor and cross validating to find an optimal depth. #from sklearn.ensemble import BaggingRegressor #from sklearn.cross_validation import KFold #max_score = 0 #best_n = 0 #for n in range(1,25): # bgr_scr_2 = 0. # bgg = BaggingRegressor(n_estimators=n) # for train, test in KFold(len(pred), n_folds=5, shuffle=True): # bgg.fit(pred_train, resp_train) # bgr_scr_2 += bgg.score(pred_test,resp_test)/5 # if bgr_scr_2 > max_score: # max_score = bgr_scr_2 # best_n = n #print "Optimal depth of estimators for Tree Bagging = %d || Model Accuracy = %.2f"%(best_n,max_score)
The FOR loop could not be executed due limitations of the WU Serve, however the code should give an accurate answer. The default n_estimators seems to be performing quite well as an increase reduces the M odel accuracy.
(b) Random Forest In [36]: # Using the Random Forest as the prediction model from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(pred_train, resp_train) rfr_scr_2 = rfr.score(pred_test, resp_test) "Random Forest Model Accuracy: %.2f percent"%(100*rfr_scr_2)
Out[36]: 'Random Forest Model Accuracy: 96.09 percent' In [37]: # Using tree bagging as regressor and cross validating to find an optimal depth. #from sklearn.ensemble import RandomForestRegressor #from sklearn.cross_validation import KFold #max_score = 0 #best_n = 0 #for n in range(1,25): # rfr_scr_2 = 0. # rfr = RandomForestRegressor(n_estimators=n) # for train, test in KFold(len(pred), n_folds=5, shuffle=True): # rfr.fit(pred_train, resp_train) # rfr_scr_2 += rfr.score(pred_test,resp_test)/5 # if rfr_scr_2 > max_score: # max_score = rfr_scr_2 # best_n = n #print "Optimal depth of estimators for Tree Bagging = %d || Model Accuracy = %.2f"%(best_n,max_score)
The FOR loop could not be executed due limitations of the WU Serve, however the code should give an accurate answer. The default n_estimators seems to be performing quite well as an increase reduces the M odel accuracy.
Visually plotting the oob accuracy for tree sizes 1 to 20 In [43]: errors = [] num_trees = [3,4,5,6,7,8,9,10,15,20] for i in num_trees: bgg = BaggingRegressor(n_estimators=i, oob_score=True) bgg.fit(pred_train, resp_train) rfr = RandomForestRegressor(n_estimators=i, oob_score=True) rfr.fit(pred_train, resp_train) errors.append([bgg.oob_score_,rfr.oob_score_]) errors_df = pd.DataFrame(errors, columns=['Tree Bagging','Random Forest'], index=num_trees) errors_df.plot(ylim=[0,1], figsize=(16,5)) Out[43]:
When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement, part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left out. This left out portion can be used to estimate the generalization error without having to rely on a separate validation set. This estimate comes “for free” as no additional data is needed and can be used for model selection. Ref: http://scikit-learn.org/stable/modules/grid_search.html#out-of-bag-estimates
OOB Accuracy acts like the cross validation in case of ensemble methods. We can see here after 10 n_estimators there is very less M odel improvement and after 15 it becomes almost irrelevant. There maximum n_estimators for these 2 methods is 15 after which tuning the parameters will actually not pay off.
2. Boosting Methods In boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
(a) Ada Boosting Model In [44]: # Using the Ada Boost Regressor as the prediction model from sklearn.ensemble import AdaBoostRegressor abr = AdaBoostRegressor(n_estimators=10) abr.fit(pred_train, resp_train) abr_scr_2 = abr.score(pred_test, resp_test) "Ada Boost Model Accuracy: %.2f percent"%(100*abr_scr_2) Out[44]: 'Ada Boost Model Accuracy: 73.02 percent'
(b) Gradient Boosting Model In [45]: # Using the Gradient Boost Regressor as the prediction model from sklearn.ensemble import GradientBoostingRegressor gbr = GradientBoostingRegressor(n_estimators=10) gbr.fit(pred_train, resp_train) gbr_scr_2 = gbr.score(pred_test, resp_test) "Gradient Boost Model Accuracy: %.2f percent"%(100*gbr_scr_2) Out[45]: 'Gradient Boost Model Accuracy: 65.18 percent'
Summary of Methods In [46]: names = ['Linear Model |','Decision Tree |','Bagging Tree |','Random Forests |','Ada Boosting |','Gradient B oosting'] score1 = [lr_scr_1, dtr_scr_1, bgr_scr_1, rfr_scr_1, abr_scr_1, gbr_scr_1] score2 = [lr_scr_2, dtr_scr_2, bgr_scr_2, rfr_scr_2, abr_scr_2, gbr_scr_2] pd.DataFrame([score1,score2], columns=names, index=['Predictor Set 1','Predictor Set 2']) Out[46]: Linear M odel | Decision Tree | Bagging Tree | Random Forests | Ada Boosting | Gradient Boosting Predictor Set 1 0.931023
0.994541
0.999599
0.999518
0.780348
0.752789
Predictor Set 2 0.762721
0.932484
0.960985
0.960883
0.730174
0.651788
Author: Kushal Agrawal Thank you :)