A Gradient Boosting Machine

4 downloads 0 Views 2MB Size Report
Aug 28, 2013 - for some credit scoring applications. ➢ We propose ... Case Study - Modeling US Home Equity Data .... Age of oldest trade line in months. NINQ.
Imposing Domain Knowledge on Algorithmic Learning An Effective Approach to Construct Deployable Predictive Models

Dr. Gerald Fahner FICO August 28, 2013

Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation's express consent.

1

© 2013 Fair Isaac Corporation.

Objective » What lenders seek from models: 1. Accurate predictions of customer behavior 2. Insights into fitted relationships 3. Ability to impose domain knowledge before deploying models

» Algorithmic learning procedures address 1. and 2., but pay scant attention to 3.

» As a consequence, algorithmic models may not be deployable for some credit scoring applications  We propose an effective approach to impose domain knowledge on algorithmic learning that paves the way for deployment 2

© 2013 Fair Isaac Corporation. Confidential.

Agenda » Algorithmic Learning Illustration » Case Study - Modeling US Home Equity Data » Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards » Summary

3

© 2013 Fair Isaac Corporation. Confidential. © 2013 Fair Isaac Corporation. Confidential.

Sketch of Tree Ensemble Models

Prediction Function (x1, x2)  E[y] Data

Ensemble score

Tree 1

y

E[y] Tree 2 Aggregation Mechanism

x1

x2

Tree M

Examples: Random Forest [1] Stochastic Gradient Boosting [2] 4

© 2013 Fair Isaac Corporation. Confidential.

x1

x2

Demonstration Problem Simulated noisy data from an underlying “true” prediction function to create a synthetic development data set 10

5

0

True Prediction function

-5

-10 2

2

1 0

x2

-1

-2

-2

-1

0

1

Simulated noisy data points

x1

Next fit Stochastic Gradient Boosting to approximate the data 5

© 2013 Fair Isaac Corporation. Confidential.

Stochastic Gradient Boosting 1 Tree

10

10

5

5

0

0

-5

-5

-10 2

-10 22

1 0 -1

-2

-2

-1

Prediction Function 6

© 2013 Fair Isaac Corporation. Confidential.

0

1

1 0 -1

-2

-2

-1

0

Residual error

1

2

Stochastic Gradient Boosting 5 Trees

10

10

5

5

0

0

-5

-5

-10 2

-10 22

1 0 -1

-2

-2

-1

Prediction Function 7

© 2013 Fair Isaac Corporation. Confidential.

0

1

1 0 -1

-2

-2

-1

0

Residual error

1

2

Stochastic Gradient Boosting 25 Trees

10

10

5

5

0

0

-5

-5

-10 2

-10 22

1 0 -1

-2

-2

-1

Prediction Function 8

© 2013 Fair Isaac Corporation. Confidential.

0

1

1 0 -1

-2

-2

-1

0

Residual error

1

2

Stochastic Gradient Boosting 100 Trees

10

10

5

5

0

0

-5

-5

-10 2

-10 22

1 0 -1

-2

-2

-1

Prediction Function 9

© 2013 Fair Isaac Corporation. Confidential.

0

1

1 0 -1

-2

-2

-1

0

Residual error

1

2

Stochastic Gradient Boosting 200 Trees

10

10

5

5

0

0

-5

-5

-10 2

-10 22

1 0 -1

-2

-2

-1

Prediction Function 10

© 2013 Fair Isaac Corporation. Confidential.

0

1

1 0 -1

-2

-2

-1

0

Residual error pure noise

1

2

Agenda » Algorithmic Learning Illustration » Case Study - Modeling US Home Equity Data » Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards » Summary

11

© 2013 Fair Isaac Corporation. Confidential. © 2013 Fair Isaac Corporation. Confidential.

Problem and Data Predict likelihood of default on US home equity loans. Binary Good/Bad target. Sample size: N(total) = 5,960, N(bad) = 1,189. 12 predictor candidates Predictor Candidates

Description

REASON

‘Home improvement’ or ‘debt consolidation’

JOB

Six occupational categories

LOAN

Amount of loan request

MORTDUE

Amount due on existing mortgage

VALUE

Value of current property

DEBTINC

Debt-to-income ratio

YOJ

Years at present job

DEROG

Number of major derogatory reports

CLNO

Number of trade lines

DELINQ

Number of delinquent trade lines

CLAGE

Age of oldest trade line in months

NINQ

Number of recent credit inquiries

Data source as of 07/25/2013: http://old.cba.ua.edu/~mhardin/DATAMiningdatasets2/hmeq.xls 12

© 2013 Fair Isaac Corporation. Confidential.

Stochastic Gradient Boosting (SGB) Experiments » Use SGB to approximate log(Odds) of being “Good” [3] » Will use Area Under Curve (AUC) as evaluation measure throughout 5-fold cross-validation of AUC

1. Model that is additive in the predictors (trees have 2 leaves) 2. Model cable of capturing interactions between predictors (trees have 15 leaves)

 Interactions appear to be substantial Additive

13

© 2013 Fair Isaac Corporation. Confidential.

w/Interactions

Eminently Useful Model Diagnostics Following [4]

Variable Importance

Interaction Test Statistics

Relative to most important variable

Whether a variable interacts with any other variables

DEBTINC DELINQ VALUE CLAGE DEROG LOAN CLNO MORTDUE NINQ JOB YOJ REASON

14

© 2013 Fair Isaac Corporation. Confidential.

1.00 0.49 0.45 0.40 0.34 0.25 0.24 0.23 0.23 0.20 0.20 0.07

DEBTINC VALUE CLNO CLAGE DELINQ YOJ MORTDUE LOAN DEROG JOB

0.029 0.022 0.014 0.013 0.010 0.008 0.007 0.007 0.006 0.006

Black Box Busters Diagnosing Predictive Relationships Learned by Complex Models

Score out complex model at regular grid points across the range of one predictor at a time (here x1), while keeping all other predictors fixed Here we fix the other predictors to their values for observation 1 8

Sweep through range of x1 ( x1 x12... x1P )

Trained SGB

Score (logOdds) 6 as function 4 of x1, after 2 centering

Curve 1

0 -2

Other predictors fixed to their values for observation 1

-4 -6 -8 -2

-1.5

-1

-0.5

0

0.5

Range of x1

15

© 2013 Fair Isaac Corporation. Confidential.

1

1.5

2

Diagnosing Predictive Relationships Learned by Complex Models

8

Sweep through range of x1 ( x1 x12... x1P ) ( x1 x22... x2 P )

Trained SGB

Score (logOdds) 6 as function 4 of x1, after 2 centering

Curves 1, 2, 3

0 -2

( x1 x32... x3P )

-4 -6

Other predictors fixed to their values for observations 1, 2, 3

-8 -2

-1.5

-1

-0.5

0

0.5

Range of x1

16

© 2013 Fair Isaac Corporation. Confidential.

1

1.5

2

Conditioning on Observations 1 to N

8

Sweep through range of x1 ( x1 x12... x1P ) Trained SGB

Score (logOdds) 6 as function 4 of x1, after 2 centering

Curves 1 to N

0 -2

( x1 xN 2 ... xNP )

-4 -6

Other predictors fixed to their values for observations 1 to N

-8 -2

-1.5

-1

-0.5

0

0.5

Range of x1

17

© 2013 Fair Isaac Corporation. Confidential.

1

1.5

2

Partial Dependence Plot Defined in [4] as sample average over the N curves 8

8

6

6

4

4

2

2

0

0

-2

-2

-4

-4

-6

-6

-8 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Range of x1

18

-8 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Range of x1

When x1 enters the model additively…

When x1 participates in significant interactions…

… then this plot summarizes influence of x1 on score well

… then explore relationship further by creating two-dimensional partial dependence plots

© 2013 Fair Isaac Corporation. Confidential.

One-dimensional Partial Dependence Plots Examples from Case Study

Partial dependence (higher the better)

Weight of Evidence

Debt-to-income ratio

Concerned about increasing score?

19

© 2013 Fair Isaac Corporation. Confidential.

Missing

Missing

Counts

Number of recent inquiries

Concerned about increasing score?

Two-dimensional Partial Dependence Plot Interaction between Debt-to-income ratio and Property value

Property value matters less for larger Debt-to-income ratios

(Debt-to-income ratio) (Property value) 20

© 2013 Fair Isaac Corporation. Confidential.

Pros and Cons of Algorithmic Learning Pros Objective, no strong assumptions Accurate fit to historic data Push-button procedures (close to) Informative diagnostics, considerable insights

Cons Predictive relationships may not be palatable

No obvious way to impose domain expertise May not be deployable

21

© 2013 Fair Isaac Corporation. Confidential.

Agenda » Algorithmic Learning Illustration » Case Study - Modeling US Home Equity Data » Diagnosing Complex Models

» Algorithmic Learning-Informed Construction of Palatable Scorecards » Summary

22

© 2013 Fair Isaac Corporation. Confidential. © 2013 Fair Isaac Corporation. Confidential.

Alternative Modeling Strategies to Fit a Scorecard Structure of score formula

Comparing two fitting objectives 1.

Approximate log(Odds) of being “Good” •

2.

Penalized Bernoulli Likelihood

Approximate ensemble score from Stochastic Gradient Boosting •

Penalized Least Squares

Constraints on score formula Apply linear inequality '' constraints between score weight of neighboring bins covering these intervals: Debt ratio in [5%, inf) Number of inquiries in [0, inf)

Score = Sum of staircase functions in 12 binned predictors Categorical and missing values receive their own bins Continuous predictors are binned into approximately equal-size bins 5-fold cross-validation of AUC Approximate SGB score Approximate log(Odds)

to force monotonic decreasing patterns

Constraining staircase functions is an effective approach to avoid counterintuitive score weight patterns 23

© 2013 Fair Isaac Corporation. Confidential.

4 5 8 12 25 68 117 Avg. number of bins / characteristic

Conjecture About Statistical Benefit of Approximating Ensemble Score » Ensemble score can be seen as well smoothed version of original dependent variable, carrying less noise than original target and also approximately unbiased  We conjecture that variance of the estimates is reduced by predicting ensemble score instead of predicting the binary target. As a result, score performance is improved

Note: » Use of penalized MLE to approximate original target also reduces variance of estimates. But empirically we find this approach to be less effective

24

© 2013 Fair Isaac Corporation. Confidential.

Constrained Scorecards Address Palatability Issues Debt to income ratio

SGB

Scorecard

25

© 2013 Fair Isaac Corporation. Confidential.

Number of recent inquiries

L E S S

P A L A T A B L E

M O R E

P A L A T A B L E

SGB

Scorecard

Capturing Interactions With Segmented Scorecards 2-stage Approach for Growing Palatable Scorecard Trees Palatable (segmented) scorecards Ensemble model

Ensemble score

Tree 1

x6 > 630

x1 > 7

Tree 2

Deploy model

+ x2 > 3

Tree M

Data

x9 < 55

Model Diagnostics s/c 1

s/c 2

s/c 3

s/c 4

s/c 5

Domain knowledge

Stage I » Use algorithmic learning to generate ensemble score » Generate model diagnostics 26

© 2013 Fair Isaac Corporation. Confidential.

Stage II » Recursively grow segmented scorecard tree to approximate ensemble score » Guide split decisions by interaction diagnostics » May impose domain knowledge via constraints on score formula

CART-like Greedy Recursive Scorecard Segmentation Informed by Algorithmic Learning and Domain Knowledge

Parent population Interaction test statistics provides list of segmentation candidate (domain knowledge can override)

Develop candidate scorecards for tentative {DEBTINC, VALUE, CLNO,..} splits and pick winner*, if any Sub-pop 1

Sub-pop 2

Score weights Scorecard 2

Scorecard 1

Fit (constrained) scorecards as palatable approximations of ensemble score

Score weights

» Best practice to evaluate splitting objective on a validation set (different from test set) 27

© 2013 Fair Isaac Corporation. Confidential.

Apply recursively until no improvement, or until reaching low count threshold



» Least Squares is consistent with scorecard fit objective. But can use other favorite metrics (e.g. AUC with respect to original target) to make split decisions



*Need objective function to decide on splitting:

All Modeling Strategies Compared Cross-validated Alternative Approaches on Same Folds

5-fold cross-validation of AUC

Stochastic Gradient Boosting Scorecard Segmented S/c Additive (2 leaves)

28

© 2013 Fair Isaac Corporation. Confidential.

w/Interactions (15 leaves)

2-stage / Approximate SGB

Scorecard Segmented S/c Approximate log(odds)

Results With Segmented Scorecards Case study experiences » Split decisions are noisy, but more stable when approximating SGB score instead of log(Odds) » Various tree alternatives come up during cross-validation - finding the best tree is an illusion » 2-stage approach effective at producing segmented scorecards with performance close to SGB, while a gap remains Example result from 2-stage approach

Academic research on scorecard segmentation » Segmentation sometimes but not always helps [5], [6]

Our experiences with big data sets » Applied 2-stage approach to credit scoring and fraud problems with several million observations and 100+ predictors. Often find segmentations outperforming single scorecards by a few % in AUC, KS

» Restricting segmentation candidates based on SGB interaction diagnostics reduces hypothesis space and accelerates tree search » Automated segmentations tend to match intuition (clean/delinquent, young/old, …) and shave weeks off tedious manual segmentation analysis

29

© 2013 Fair Isaac Corporation. Confidential.

» Segmented scorecards at times achieve close to SGB performance, at other times there remains a gap

Summary » Construction of accurate, deployable credit scoring models faces a unique challenge: Find a highly predictive model within a flexible yet palatable model family » Constrained (segmented) scorecards can meet this challenge » Proposed 2-stage approach combines algorithmic learning and domain knowledge to inform the search for highly predictive, deployable (segmented) scorecards » We are testing it for projects across credit scoring, application and insurance claims fraud and see good potential to increase the effectiveness of practical scoring models

30

© 2013 Fair Isaac Corporation. Confidential.

THANK YOU

Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation's express consent.

31

© 2013 Fair Isaac Corporation.

References [1] Random Forests, by Leo Breiman, Machine Learning, Volume 45, Issue 1 (Oct., 2001), pp. 5-32. [2] Stochastic Gradient Boosting, by Jerome Friedman, (March 1999). [3] Greedy Function Approximation: A Gradient Boosting Machine, by Jerome Friedman, The Annals of Statistics, Volume 29, Number 5 (2001), pp. 1189-1232. [4] Predictive Learning Via Rule Ensembles, by Jerome Friedman and Bogdan Popescu, Ann. Appl. Stat. Volume 2, Number 3 (2008), pp. 916-954. [5] Does Segmentation Always Improve Model Performance in Credit Scoring?, by Katarzyna Bijak and Lyn Thomas, Expert Systems with Applications, Volume 39, Issue 3 (Feb. 2012), pp. 2433-2442. [6] Optimal Bipartite Scorecards, by David Hand et. al., Expert Systems with Applications, Volume 29, Issue 3 (Oct. 2005), pp. 684-690.

32

© 2013 Fair Isaac Corporation. Confidential.