Project "Rossmann Store Sales"

5 downloads 170 Views 434KB Size Report
Project "Rossmann Store Sales". Kaggle Competition Script Score: 0.18613. Rank: https://www.kaggle.com/kushal1412/results. Brief Summary of the Data.
Project "Rossmann Store Sales" Kaggle Competition Script Score: 0.18613 Rank: https://www.kaggle.com/kushal1412/results

Brief Summary of the Data We have historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.

Files train.csv - historical data including Sales test.csv - historical data excluding Sales store.csv - supplemental information about the stores

Data fields Id - an Id that represents a (Store, Date) duple within the test set Store - a unique Id for each store Sales - the turnover for any given day (this is what you are predicting) Customers - the number of customers on a given day Open - an indicator for whether the store was open: 0 = closed, 1 = open StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools StoreType - differentiates between 4 different store models: a, b, c, d Assortment - describes an assortment level: a = basic, b = extra, c = extended CompetitionDistance - distance in meters to the nearest competitor store CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened Promo - indicates whether a store is running a promo on that day Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2 PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Data Cleaning and Tranforming Importing the basic libraries and the data set into pandas. In [1]: import pandas as pd import statsmodels.api as sm import matplotlib as plt import numpy as np from sklearn import cross_validation as cv test = pd.read_csv('test.csv', low_memory=False, parse_dates = ['Date']) train = pd.read_csv('train.csv', low_memory=False, parse_dates = ['Date']) store = pd.read_csv('store.csv', low_memory=False)

Checking the statistics of test set. In [2]: test.describe() Out[2]: Id

Store

DayOfWeek

Open

Promo

SchoolHoliday

count 41088.000000 41088.000000 41088.000000 41077.000000 41088.000000 41088.000000 mean 20544.500000 555.899533

3.979167

0.854322

0.395833

0.443487

std

11861.228267 320.274496

2.015481

0.352787

0.489035

0.496802

min

1.000000

1.000000

0.000000

0.000000

0.000000

25%

10272.750000 279.750000

2.000000

1.000000

0.000000

0.000000

50%

20544.500000 553.500000

4.000000

1.000000

0.000000

0.000000

75%

30816.250000 832.250000

6.000000

1.000000

1.000000

1.000000

max

41088.000000 1115.000000

7.000000

1.000000

1.000000

1.000000

1.000000

Removing NaNs values from Open and replacing it with 1 assuming those 11 stores are open. In [3]: test.loc[test.Open.isnull(),'Open'] = 1 test.describe() Out[3]: Id

Store

DayOfWeek

Open

Promo

SchoolHoliday

count 41088.000000 41088.000000 41088.000000 41088.000000 41088.000000 41088.000000 mean 20544.500000 555.899533

3.979167

0.854361

0.395833

0.443487

std

11861.228267 320.274496

2.015481

0.352748

0.489035

0.496802

min

1.000000

1.000000

0.000000

0.000000

0.000000

25%

10272.750000 279.750000

2.000000

1.000000

0.000000

0.000000

50%

20544.500000 553.500000

4.000000

1.000000

0.000000

0.000000

75%

30816.250000 832.250000

6.000000

1.000000

1.000000

1.000000

max

41088.000000 1115.000000

7.000000

1.000000

1.000000

1.000000

1.000000

Adding Year, Month, and Week Cols and checking the statistics of train set. In [4]: train['Year'] = train['Date'].dt.year train['Month'] = train['Date'].dt.month train['Week'] = train['Date'].dt.week train.describe() Out[4]: Store

DayOfWeek

Sales

Customers

Open

Promo

SchoolHoliday

Year

count 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000000 1017209.000 mean 558.429727

3.998341

5773.818972

633.145946

0.830107

0.381515

0.178647

2013.832292

std

321.908651

1.997391

3849.926175

464.411734

0.375539

0.485759

0.383056

0.777396

min

1.000000

1.000000

0.000000

0.000000

0.000000

0.000000

0.000000

2013.000000

25%

280.000000

2.000000

3727.000000

405.000000

1.000000

0.000000

0.000000

2013.000000

50%

558.000000

4.000000

5744.000000

609.000000

1.000000

0.000000

0.000000

2014.000000

75%

838.000000

6.000000

7856.000000

837.000000

1.000000

1.000000

0.000000

2014.000000

max

1115.000000

7.000000

41551.000000

7388.000000

1.000000

1.000000

1.000000

2015.000000

Removing rows where Sales is zero and checking the training set. In [5]: train = train.loc[train.Sales > 0] train.describe() Out[5]: Store

DayOfWeek

Sales

Customers

Open

Promo

SchoolHoliday Year

M onth

count 844338.000000 844338.000000 844338.000000 844338.000000 844338 844338.000000 844338.000000 844338.000000 844338.0000 mean 558.421374

3.520350

6955.959134

762.777166

1

0.446356

0.193578

2013.831945

5.845774

std

321.730861

1.723712

3103.815515

401.194153

0

0.497114

0.395102

0.777271

3.323959

min

1.000000

1.000000

46.000000

8.000000

1

0.000000

0.000000

2013.000000

1.000000

25%

280.000000

2.000000

4859.000000

519.000000

1

0.000000

0.000000

2013.000000

3.000000

50%

558.000000

3.000000

6369.000000

676.000000

1

0.000000

0.000000

2014.000000

6.000000

75%

837.000000

5.000000

8360.000000

893.000000

1

1.000000

0.000000

2014.000000

8.000000

max

1115.000000

7.000000

41551.000000

7388.000000

1

1.000000

1.000000

2015.000000

12.000000

Checking the statistics of store set. In [6]: store.describe() Out[6]: Store

CompetitionDistance CompetitionOpenSinceM onth CompetitionOpenSinceYear Promo2

Promo2SinceWeek

count 1115.00000 1112.000000

761.000000

761.000000

1115.000000 571.000000

mean 558.00000

5404.901079

7.224704

2008.668857

0.512108

23.595447

std

322.01708

7663.174720

3.212348

6.195983

0.500078

14.141984

min

1.00000

20.000000

1.000000

1900.000000

0.000000

1.000000

25%

279.50000

717.500000

4.000000

2006.000000

0.000000

13.000000

50%

558.00000

2325.000000

8.000000

2010.000000

1.000000

22.000000

75%

836.50000

6882.500000

10.000000

2013.000000

1.000000

37.000000

max

1115.00000 75860.000000

12.000000

2015.000000

1.000000

50.000000

Removing null values from each column of store set and replacing it with their respective means and checking the store statistics again. In [7]: store.loc[store.CompetitionDistance.isnull(),'CompetitionDistance']=int(round(store.CompetitionDistance.mean ())) store.loc[store.CompetitionOpenSinceMonth.isnull(),'CompetitionOpenSinceMonth']=int(round(store.Competition OpenSinceMonth.mean())) store.loc[store.CompetitionOpenSinceYear.isnull(),'CompetitionOpenSinceYear']=int(round(store.CompetitionOp enSinceYear.mean())) store.loc[store.Promo2SinceWeek.isnull(),'Promo2SinceWeek']=int(round(store.Promo2SinceWeek.mean())) store.loc[store.Promo2SinceYear.isnull(),'Promo2SinceYear']=int(round(store.Promo2SinceYear.mean())) store.describe()

Out[7]: Store

CompetitionDistance CompetitionOpenSinceM onth CompetitionOpenSinceYear Promo2

Promo2SinceWeek

count 1115.00000 1115.000000

1115.000000

1115.000000

1115.000000 1115.000000

mean 558.00000

5404.901345

7.153363

2008.773991

0.512108

23.792825

std

322.01708

7652.849306

2.655365

5.120018

0.500078

10.117938

min

1.00000

20.000000

1.000000

1900.000000

0.000000

1.000000

25%

279.50000

720.000000

6.000000

2008.000000

0.000000

22.000000

50%

558.00000

2330.000000

7.000000

2009.000000

1.000000

24.000000

75%

836.50000

6875.000000

9.000000

2011.000000

1.000000

24.000000

max

1115.00000 75860.000000

12.000000

2015.000000

1.000000

50.000000

Merging the tables train and store and replacing string values in store columns for computational purpose and then checking the merged table statistics. In [8]: data = train.merge(store,on='Store') data = data.drop(['Open'], axis=1) data.StoreType.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.Assortment.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.StateHoliday.replace({'0':0,'a':1,'b':2,'c':3,'d':4},inplace=True) data.describe() Out[8]: Store

DayOfWeek

Sales

Customers

Promo

StateHoliday

SchoolHoliday Year

count 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 mean 558.421374

3.520350

6955.959134

762.777166

0.446356

0.001418

0.193578

2013.831945

std

321.730861

1.723712

3103.815515

401.194153

0.497114

0.047578

0.395102

0.777271

min

1.000000

1.000000

46.000000

8.000000

0.000000

0.000000

0.000000

2013.000000

25%

280.000000

2.000000

4859.000000

519.000000

0.000000

0.000000

0.000000

2013.000000

50%

558.000000

3.000000

6369.000000

676.000000

0.000000

0.000000

0.000000

2014.000000

75%

837.000000

5.000000

8360.000000

893.000000

1.000000

0.000000

0.000000

2014.000000

max

1115.000000

7.000000

41551.000000

7388.000000

1.000000

3.000000

1.000000

2015.000000

Initial Feature Engineering: Calculating the "competition open since" time in months. Calculating the "Promo open since" time in months. In [9]: data['CompetitionOpen'] = 12 * (data.Year - data.CompetitionOpenSinceYear) + \ (data.Month - data.CompetitionOpenSinceMonth) data['CompetitionOpen'] = data.CompetitionOpen.apply(lambda x: x if x > 0 else 0) data.drop(['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear'], axis = 1, inplace = True) data['PromoOpen'] = 12 * (data.Year - data.Promo2SinceYear) + \ (data.Week - data.Promo2SinceWeek) / float(4) data['PromoOpen'] = data.PromoOpen.apply(lambda x: x if x > 0 else 0) data.drop(['Promo2SinceYear', 'Promo2SinceWeek', 'PromoInterval'], axis = 1, inplace = True) data.describe()

Out[9]: Store

DayOfWeek

Sales

Customers

Promo

StateHoliday

SchoolHoliday Year

count 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 844338.000000 mean 558.421374

3.520350

6955.959134

762.777166

0.446356

0.001418

0.193578

2013.831945

std

321.730861

1.723712

3103.815515

401.194153

0.497114

0.047578

0.395102

0.777271

min

1.000000

1.000000

46.000000

8.000000

0.000000

0.000000

0.000000

2013.000000

25%

280.000000

2.000000

4859.000000

519.000000

0.000000

0.000000

0.000000

2013.000000

50%

558.000000

3.000000

6369.000000

676.000000

0.000000

0.000000

0.000000

2014.000000

75%

837.000000

5.000000

8360.000000

893.000000

1.000000

0.000000

0.000000

2014.000000

max

1115.000000

7.000000

41551.000000

7388.000000

1.000000

3.000000

1.000000

2015.000000

Exploratory Analysis Checking the correlation between the features in the data with respect to Sales. In [10]: data.corr()['Sales'] Out[10]: Store 0.007723 DayOfWeek -0.178753 Sales 1.000000 Customers 0.823552 Promo 0.368199 StateHoliday 0.020106 SchoolHoliday 0.038635 Year 0.036151 Month 0.073589 Week 0.074463 StoreType -0.016211 Assortment 0.109015 CompetitionDistance -0.036401 Promo2 -0.127556 CompetitionOpen -0.001543 PromoOpen 0.033863 Name: Sales, dtype: float64

Plotting histograms to see the skewness in the data. In [11]: import matplotlib.pyplot as pyplot data['Sales'].hist(bins=25, figsize=(16,5)) pyplot.title('Sales Histogram') Out[11]:

In [12]: data['Customers'].hist(bins=25, figsize=(16,5)) pyplot.title('Customers Histogram') Out[12]:

In [13]: data['Promo'].hist(bins=4, figsize=(16,5)) pyplot.title('Promo Histogram') Out[13]:

In [14]: data['Assortment'].hist(bins=12, figsize=(16,5)) pyplot.title('Assortment Histogram') Out[14]:

In [15]: data['StoreType'].hist(bins=16, figsize=(16,5)) pyplot.title('StoreType Histogram') Out[15]:

Visualizing the density distrbution of Customers above and below the its median. In [16]: med = data.Customers.median() print 'Median of Customers = %.2f'%(med) data.query('Customers = @med')['Customers'].plot(kind='kde', color='blue', figsize=(16,5)) Median of Customers = 676.00 Out[16]:

Checking the Scatter plot of Sales with respect to Customer In [17]: data.plot(kind='scatter', x='Customers', y='Sales', figsize=(12,6)) Out[17]: /usr/local/lib/python2.7/dist-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if self._edgecolors == str('face'):

Removing the outliers (Unlikely event) In [18]: data = data.query('Customers < 6000 and Sales < 40000') data.plot(kind='scatter', x='Customers', y='Sales', figsize=(12,6)) Out[18]:

Defining the data into predictor and response set and creating a basic prediction model by using OLS linear regression. Selecting only 2 features which have highest correlation with sales into the predictor set.

In [19]: pred = data[['Customers','Promo']].copy() # Adding a constant to the predictor set. pred = sm.add_constant(pred) # Taking sales as the response set. resp = data.Sales.copy() # Fitting the OLS model ols = sm.OLS(resp,pred) ols_fit = ols.fit() # Printing the summary print "AIC: %.2f || R2: %.2f"%(ols_fit.aic, ols_fit.rsquared) ols_fit.summary() AIC: 14876321.76 || R2: 0.73 Out[19]: OLS Regression Results Dep. Variable:

Sales

R-squared:

0.727

M odel:

OLS

Adj. R-squared:

0.727

M ethod:

Least Squares

F-statistic:

1.126e+06

Date:

Tue, 01 Dec 2015 Prob (F-statistic): 0.00

Time:

13:15:57

Log-Likelihood:

-7.4382e+06

No. Observations: 844336

AIC:

1.488e+07

Df Residuals:

844333

BIC:

1.488e+07

Df M odel:

2

Covariance Type: nonrobust coef const

std err t

1710.9245 3.916

Customers 6.0537 Promo Omnibus:

0.004

1405.6589 3.609

436.954

P>|t|

[95.0% Conf. Int.]

0.000 1703.250 1718.599

1353.640 0.000 6.045 6.062 389.527

0.000 1398.586 1412.732

132520.272 Durbin-Watson:

0.322

Prob(Omnibus): 0.000

Jarque-Bera (JB): 1827090.441

Skew:

0.310

Prob(JB):

0.00

Kurtosis:

10.180

Cond. No.

2.07e+03

Feature Engineering:- Adding new features to the defined predictor set. Sales per Customer (Amount of Sales on that Day per Customer on that day) Sales per Day of the Week Sales per Store Type (Amount of Sales in particular Store Type on that day)

In [20]: # New features Sales per Customer is improving the Linear Model by approximately 18 percent pred['SalesPerCust'] = data.Sales/data.Customers # New features Sales per Assortment is improving the Linear Model by approximately 1.1 percent pred['SalesPerDayOfWeek'] = data.Sales/data.DayOfWeek # New features Sales per StoreType is improving the Linear Model by approximately 1.1 percent pred['SalesPerType'] = data.Sales/data.StoreType # Fitting the OLS Model ols = sm.OLS(resp,pred) ols_fit = ols.fit() #Checking the results after Feature Engineering print "AIC: %.2f || R2: %.2f"%(ols_fit.aic, ols_fit.rsquared) ols_fit.summary() AIC: 13718133.13 || R2: 0.93 Out[20]: OLS Regression Results Dep. Variable:

Sales

R-squared:

0.931

M odel:

OLS

Adj. R-squared:

0.931

M ethod:

Least Squares

F-statistic:

2.273e+06

Date:

Tue, 01 Dec 2015 Prob (F-statistic): 0.00

Time:

13:15:57

Log-Likelihood:

-6.8591e+06

No. Observations: 844336

AIC:

1.372e+07

Df Residuals:

844330

BIC:

1.372e+07

Df M odel:

5

Covariance Type: nonrobust coef

std err t

P>|t|

[95.0% Conf. Int.]

const

-4974.2880 5.412

-919.178 0.000 -4984.895 -4963.681

Customers

6.4351

0.003

1967.010 0.000 6.429 6.442

Promo

161.5708

1.986

81.372

SalesPerCust

639.0728

0.484

1320.911 0.000 638.125 640.021

SalesPerDayOfWeek 0.1072

0.000

290.974

0.000 0.106 0.108

SalesPerType

0.000

356.952

0.000 0.119 0.121

0.1201

Omnibus:

419331.860 Durbin-Watson:

0.000 157.679 165.462

0.342

Prob(Omnibus): 0.000

Jarque-Bera (JB): 19045818.178

Skew:

-1.682

Prob(JB):

0.00

Kurtosis:

26.023

Cond. No.

4.19e+04

Trying to further improve the model by taking Polynomial Features. In [21]: for p in range (1,6): pred['Custmers^%d'%p] = pred.Customers.pow(p) ols = sm.OLS(resp,pred) ols_fit = ols.fit() print "Polynomial %d || AIC: %.2f || R2: %.4f"%(p, ols_fit.aic, ols_fit.rsquared) Polynomial Polynomial Polynomial Polynomial Polynomial

1 2 3 4 5

|| || || || ||

AIC: AIC: AIC: AIC: AIC:

13718135.32 13376105.34 13333519.58 13325533.80 14683939.64

|| || || || ||

R2: R2: R2: R2: R2:

0.9308 0.9539 0.9561 0.9566 0.7829

It can be observed here that after the 2nd Degree Polynomial there is no significant improvement in the Model by introducing the Polynomial Features. Splitting the predictor and response sets into training and test sets In [21]: from sklearn import cross_validation as cv pred_train, pred_test, resp_train, resp_test = cv.train_test_split( pred, resp, test_size=0.5, random_state=0)

Using the LinearRegression as the prediction model and verifying the predictor. In [22]: from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(pred_train, resp_train) lr_scr_1 = lr.score(pred_test, resp_test) print "Linear Model Accuracy: %.2f percent"%(100*lr_scr_1) Linear Model Accuracy: 93.10 percent

Checking how other Models perform using the same predictor set. In [23]: # Using the Decision Tree as the prediction model from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor(max_depth=10) dtr.fit(pred_train, resp_train) dtr_scr_1 = dtr.score(pred_test, resp_test) print "Decision Tree Model Accuracy: %.2f percent"%(100*dtr_scr_1) Decision Tree Model Accuracy: 99.45 percent In [24]: # Using the Tree Bagging as the prediction model from sklearn.ensemble import BaggingRegressor bgr = BaggingRegressor(n_estimators=10) bgr.fit(pred_train, resp_train) bgr_scr_1 = bgr.score(pred_test, resp_test) "Bagging Model Accuracy: %.2f percent"%(100*bgr_scr_1) Out[24]: 'Bagging Model Accuracy: 99.96 percent' In [25]: # Using the Random Forest as the prediction model from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(pred_train, resp_train) rfr_scr_1 = rfr.score(pred_test, resp_test) "Random Forest Model Accuracy: %.2f percent"%(100*rfr_scr_1) Out[25]: 'Random Forest Model Accuracy: 99.95 percent' In [26]: # Using the KNN as the prediction model from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor(n_neighbors=10, weights='uniform') knn.fit(pred_train, resp_train) knn_scr_1 = knn.score(pred_test, resp_test) print "KNN Model Accuracy: %.2f percent"%(100*knn_scr_1) KNN Model Accuracy: 98.34 percent

In [27]: # Using the Ada Boost as the prediction model from sklearn.ensemble import AdaBoostRegressor abr = AdaBoostRegressor(n_estimators=10) abr.fit(pred_train, resp_train) abr_scr_1 = abr.score(pred_test, resp_test) "Ada Boost Model Accuracy: %.2f percent"%(100*abr_scr_1) Out[27]: 'Ada Boost Model Accuracy: 78.03 percent' In [28]: # Using the Gradient Boost Regressor as the prediction model from sklearn.ensemble import GradientBoostingRegressor gbr = GradientBoostingRegressor(n_estimators=10) gbr.fit(pred_train, resp_train) gbr_scr_1 = gbr.score(pred_test, resp_test) "Gradient Boost Model Accuracy: %.2f percent"%(100*gbr_scr_1) Out[28]: 'Gradient Boost Model Accuracy: 75.28 percent'

Defining the predictor set again for sophisticated methods, as they can handle more features. In [29]: pred = data[['Customers','Promo','DayOfWeek','Assortment','StoreType','StateHoliday','SchoolHoliday', 'Promo2','CompetitionDistance','CompetitionOpen','PromoOpen']].copy() resp = data.Sales.copy() In [30]: from sklearn import cross_validation as cv pred_train, pred_test, resp_train, resp_test = cv.train_test_split( pred, resp, test_size=0.5, random_state=0)

Comparing and discussing different ensemble prediction approaches. Ref: http://scikit-learn.org/stable/modules/ensemble.html

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Checking Linear Regression In [31]: from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(pred_train, resp_train) lr_scr_2 = lr.score(pred_test, resp_test) print "Linear Model Accuracy: %.2f percent"%(100*lr_scr_2) Linear Model Accuracy: 76.27 percent

Decision Tree In [32]: # Using the Decision Tree as the prediction model from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor(max_depth=10) dtr.fit(pred_train, resp_train) dtr_scr_2 = dtr.score(pred_test, resp_test) print "Decision Tree Model Accuracy: %.2f percent"%(100*dtr_scr_2) Decision Tree Model Accuracy: 88.15 percent

In [33]: # Using decision tree as regressor and cross validating to find an optimal depth. from sklearn.tree import DecisionTreeRegressor from sklearn.cross_validation import KFold max_score = 0 best_d = 0 for d in range(1,25): dtr_scr_2 = 0. dtr = DecisionTreeRegressor(max_depth=d) for train, test in KFold(len(pred), n_folds=5, shuffle=True): dtr.fit(pred_train, resp_train) dtr_scr_2 += dtr.score(pred_test,resp_test)/5 if dtr_scr_2 > max_score: max_score = dtr_scr_2 best_d = d print "Optimal depth of nodes for Decision Tree = %d || Model Accuracy = %.4f"%(best_d,max_score) Optimal depth of nodes for Decision Tree = 20 || Model Accuracy = 0.9346

1. Averaging Methods In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

(a) Tree Bagging In [34]: # Using the Tree Bagging as the prediction model from sklearn.ensemble import BaggingRegressor bgr = BaggingRegressor(n_estimators=10) bgr.fit(pred_train, resp_train) bgr_scr_2 = bgr.score(pred_test, resp_test) "Bagging Model Accuracy: %.2f percent"%(100*bgr_scr_2) Out[34]: 'Bagging Model Accuracy: 96.10 percent' In [35]: # Using tree bagging as regressor and cross validating to find an optimal depth. #from sklearn.ensemble import BaggingRegressor #from sklearn.cross_validation import KFold #max_score = 0 #best_n = 0 #for n in range(1,25): # bgr_scr_2 = 0. # bgg = BaggingRegressor(n_estimators=n) # for train, test in KFold(len(pred), n_folds=5, shuffle=True): # bgg.fit(pred_train, resp_train) # bgr_scr_2 += bgg.score(pred_test,resp_test)/5 # if bgr_scr_2 > max_score: # max_score = bgr_scr_2 # best_n = n #print "Optimal depth of estimators for Tree Bagging = %d || Model Accuracy = %.2f"%(best_n,max_score)

The FOR loop could not be executed due limitations of the WU Serve, however the code should give an accurate answer. The default n_estimators seems to be performing quite well as an increase reduces the M odel accuracy.

(b) Random Forest In [36]: # Using the Random Forest as the prediction model from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=10) rfr.fit(pred_train, resp_train) rfr_scr_2 = rfr.score(pred_test, resp_test) "Random Forest Model Accuracy: %.2f percent"%(100*rfr_scr_2)

Out[36]: 'Random Forest Model Accuracy: 96.09 percent' In [37]: # Using tree bagging as regressor and cross validating to find an optimal depth. #from sklearn.ensemble import RandomForestRegressor #from sklearn.cross_validation import KFold #max_score = 0 #best_n = 0 #for n in range(1,25): # rfr_scr_2 = 0. # rfr = RandomForestRegressor(n_estimators=n) # for train, test in KFold(len(pred), n_folds=5, shuffle=True): # rfr.fit(pred_train, resp_train) # rfr_scr_2 += rfr.score(pred_test,resp_test)/5 # if rfr_scr_2 > max_score: # max_score = rfr_scr_2 # best_n = n #print "Optimal depth of estimators for Tree Bagging = %d || Model Accuracy = %.2f"%(best_n,max_score)

The FOR loop could not be executed due limitations of the WU Serve, however the code should give an accurate answer. The default n_estimators seems to be performing quite well as an increase reduces the M odel accuracy.

Visually plotting the oob accuracy for tree sizes 1 to 20 In [43]: errors = [] num_trees = [3,4,5,6,7,8,9,10,15,20] for i in num_trees: bgg = BaggingRegressor(n_estimators=i, oob_score=True) bgg.fit(pred_train, resp_train) rfr = RandomForestRegressor(n_estimators=i, oob_score=True) rfr.fit(pred_train, resp_train) errors.append([bgg.oob_score_,rfr.oob_score_]) errors_df = pd.DataFrame(errors, columns=['Tree Bagging','Random Forest'], index=num_trees) errors_df.plot(ylim=[0,1], figsize=(16,5)) Out[43]:

When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement, part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left out. This left out portion can be used to estimate the generalization error without having to rely on a separate validation set. This estimate comes “for free” as no additional data is needed and can be used for model selection. Ref: http://scikit-learn.org/stable/modules/grid_search.html#out-of-bag-estimates

OOB Accuracy acts like the cross validation in case of ensemble methods. We can see here after 10 n_estimators there is very less M odel improvement and after 15 it becomes almost irrelevant. There maximum n_estimators for these 2 methods is 15 after which tuning the parameters will actually not pay off.

2. Boosting Methods In boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

(a) Ada Boosting Model In [44]: # Using the Ada Boost Regressor as the prediction model from sklearn.ensemble import AdaBoostRegressor abr = AdaBoostRegressor(n_estimators=10) abr.fit(pred_train, resp_train) abr_scr_2 = abr.score(pred_test, resp_test) "Ada Boost Model Accuracy: %.2f percent"%(100*abr_scr_2) Out[44]: 'Ada Boost Model Accuracy: 73.02 percent'

(b) Gradient Boosting Model In [45]: # Using the Gradient Boost Regressor as the prediction model from sklearn.ensemble import GradientBoostingRegressor gbr = GradientBoostingRegressor(n_estimators=10) gbr.fit(pred_train, resp_train) gbr_scr_2 = gbr.score(pred_test, resp_test) "Gradient Boost Model Accuracy: %.2f percent"%(100*gbr_scr_2) Out[45]: 'Gradient Boost Model Accuracy: 65.18 percent'

Summary of Methods In [46]: names = ['Linear Model |','Decision Tree |','Bagging Tree |','Random Forests |','Ada Boosting |','Gradient B oosting'] score1 = [lr_scr_1, dtr_scr_1, bgr_scr_1, rfr_scr_1, abr_scr_1, gbr_scr_1] score2 = [lr_scr_2, dtr_scr_2, bgr_scr_2, rfr_scr_2, abr_scr_2, gbr_scr_2] pd.DataFrame([score1,score2], columns=names, index=['Predictor Set 1','Predictor Set 2']) Out[46]: Linear M odel | Decision Tree | Bagging Tree | Random Forests | Ada Boosting | Gradient Boosting Predictor Set 1 0.931023

0.994541

0.999599

0.999518

0.780348

0.752789

Predictor Set 2 0.762721

0.932484

0.960985

0.960883

0.730174

0.651788

Author: Kushal Agrawal Thank you :)

Suggest Documents