Three Practical Approaches to Effective and Transparent Modelling with “Big Data”
Dr. Gerald Fahner Senior Director Analytic Science FICO
2
Big Data and Machine Learning create new opportunities: Predict consumer behavior more accurately Data-driven machine learning
To unlock the value requires transparent modelling: Comprehensible models and justifiable decisions Human domain expertise
Valuable Credit Decisions 3
Outline • Balance data-driven machine learning with domain expertise • Combine structured and unstructured text data for scoring and insight • Understand causal effects of business treatments on consumer behavior
4
Raw Machine Learning Approach to Scoring Tree Ensemble Model Prediction Function
Tree 1 Tree 2
Combine predictions from 500 trees
Score
Outcomes
Training Data
Scored!
? New case
Predictors
Tree 500
Predictors
Random Forest Gradient Boosting 5
Machine Learning Model Beats Traditional Model 100%
Machine Learning
% Bads Rejected
Home Equity Loan portfolio example: Logistic Regression 50%
0
0
At 10% Goods rejected: • Logreg rejects 63% Bads • Machine Learning rejects 85% Bads
50% % Goods Rejected
100% 6
But Machine Learning Model is a Black Box 1 1
1
1
3 2
3 3
5 10 11 6 5 12 9
13 6
7 8
4
5 12 611
6 13
9
5 12 7 8
8
2
10
10 11
10 4 13
2
10
9
10 210
12 7
2
119 12
5
13
12 48 6
3
4 4
11 5 3
2
3
1
12 10 7
12 11
13
3
2 1
6
8
9
1
3
13
7
42
11
5
7 9
2
3
4 4
7
1
1
1
5 11 4
11 6 3 13 4
6
6 13 13 5
10
2
5
12 7 10
6
11
7 9
8
13
12
8
9
7
8 9
8
8
9 7
PROS and CONS of Using Raw Machine Learning Tools for Credit Scoring PROS
CONS
Highly accurate fit to data
Vulnerable to data limitations
Discovery of unexpected associations May capture non-intuitive associations Automated, productive analysis
Hard to impose domain expertise
Embrace
But interpret with caution 8
Examples of Non-intuitive Associations that Could Lead to Unjustifiable Decisions Association… (after controlling for all else)
…could lead to decision
Loan Application: Consumers with 10% debt ratio have higher risk than consumers with 30% debt ratio
Applicant is rejected because her debt ratio is not high enough(!?)
Mortgage Lending: Consumers without previous mortgage have higher risk than consumers with previous mortgage
Applicant can’t get a mortgage because he doesn’t have a mortgage(!?)
How to repair such “defects” in machine learning models? 9
“Scorecardizer” Approach[1]: Converts Black Box Model Into Powerful, Comprehensible Scorecards Step I: Train Machine Learning model
Tree 1 Tree 2
Data
Combine
Step II: Convert to comprehensible (segmented) Scorecard(s) Scorecard 1 Current Thick Files Scorecard 2
Tree 500
Pros and cons as discussed
Current Young
Domain expertise
Scorecard 4
Thin Files
+ Impose expertise to warrant justifiable decisions
+ Highly automated segmented scorecard build 10
Results for Home Equity Loan Portfolio 100
% Bads Rejected
5-fold cross-validated AUC Tree Ensemble (not comprehensible)
0.96
Scorecardizer (comprehensible)
0.94
Traditional Scorecard (comprehensible)
0.91
Segmentated Scorecard Solution by Scorecardizer 50
Age of oldest credit line(mts) Value of current property($)
S/c #1
S/c #2
S/c #3
S/c #4
S/c #5
Resulting in 5 segmented scorecards
0 0
50 % Goods Rejected
100 11
Balance Data-driven Machine Learning With Domain Expertise
-Takeaways• Embrace data-driven machine learning. But brace for incomprehensible aspects of these models.
• Domain experts often need to refine credit scoring models before deployment.
12
Outline • Balance data-driven machine learning with domain expertise • Combine structured and unstructured text data for scoring and insight • Understand causal effects of business treatments on consumer behavior
13
Text Data Sources are Ubiquitous Call center records, claims, public records, collector notes, emails, blogs, social data, freeform comments, reviews, webpages, product descriptions, transcribed phone calls, news articles…
How to leverage predictive value of text data for comprehensible models, justifiable decisions? Semantic Scorecards & Topic Analysis
14
Origination Risk Score Development for Peer-to-Peer Lending Network Traditional origination data (structured)
Free-form loan descriptions Hi, thanks for considering my request. I’m a student in Southern California. I have a great credit score. I will use this loan to pay my rent, books and tuition expenses. I’ve secured a part time job. My federal loans take care of all the rest. Thanks!
I need this loan to pay off higher rate credit card debt - fixed rate at 15%, I need this loan to pay off higher rate credit card debt fixed rate at 15%, that’s the only card I use. I use. that’s the only card
Last summer we’ve extended our business for second-hand furniture into Southern Florida. We’ve been growing nice there. We need $4,000 to do enlarge showroom. Expecting to double sales thereafter. Sincerely, D.J.
Regular variable generation
Text feature and topic generation
Semantic Scorecard Combines regular variables and text-based features in a comprehensible format 15
Topic Analysis Yielded New Insight Into Risk Example of a low risk topic: “Credit card consolidation” debt, free, consolidating, consolidated, card, credit, revolving, paying, payoff, sooner, quicker, clear, accumulated, accrued, completely, …
Example of a high risk topic: “Business-related”
business, equipment, sales, capital, store, marketing, experience, location, expand, owner, retail, advertising, partner, inventory, products, profit, shop, restaurant, ... 16
Fraction Bads Rejected
Predictive Value of Free-form Loan Descriptions
Winner: Regular and text data Runner-up: Regular data only For comparison: Text data only Fraction Goods Rejected 17
Combine Structured and Unstructured Text Data for Scoring and Insight
-Takeaways• Leverage insights and value from new text data sources. • Models for credit risk leveraging text data should be as comprehensible and justifiable as traditional scorecards.
18
Outline • Balance data-driven machine learning with domain expertise • Combine structured and unstructured text data for scoring and insight • Understand causal effects of business treatments on consumer behavior
19
Correlational Versus Causal Models • Correlational predictions: What will happen? – What’s the likelihood that this loan will default? Business-as-usual (BAU) data is all we need (aka “Big Data exhaust”)
• Causal predictions: What will happen IF we give treatment A, B, C? – How will increasing credit line for this account affect its default likelihood? Perform controlled randomized experiments when possible (the gold standard). Use transparent methodology to know BAU data limitations and to make best use.
20
Price-Response Curves for Installment Loan Pricing Optimization - Based on BAU Data and Method of Propensity Score Matching[2]
Take Probability
Demand is least price sensitive for lowest risk score band
Demand is most price sensitive for highest risk score band
21
Understand Causal Effects of Business Treatments on Consumer Behavior
-Takeaways• Distinguishing between correlation and causation is very valuable for improving credit decisions. • Use robust and transparent analytic methodology to infer causal effects from BAU data.
22
References [1] G. Fahner, “Imposing Domain Knowledge on Algorithmic Learning”, Credit Scoring and Credit Control XIII conference, Edinburgh, 2013. http://www.business-school.ed.ac.uk/waf/crc_archive/2013/33.pdf
[2] G. Fahner, “Estimating Causal Effects of Credit Decisions Using Propensity Score Methodologies ”, Credit Scoring and Credit Control XI conference, Edinburgh, 2009. (Awarded “Best Paper” at the conference) http://www.business-school.ed.ac.uk/crc/conferences/conference-archive?a=45876
23
Thank You! Dr. Gerald Fahner Senior Director Analytic Science FICO
[email protected]
24