Data Mining & Machine Learning

9 downloads 0 Views 1MB Size Report
(Deep) Neural Networks. ▫Trees: ▫ Single trees: es. C5.0. ▫ Bagging: Bagged Trees, Random Forest. ▫ Boosted Trees: gbm, xgboost, lightgbm ...
Machine Learning and Actuarial Science Core Concepts GIORGIO ALFREDO SPEDICATO, PHD FCAS FSA CSPA UNISACT 2018

Introduction «machine learning» and «data mining» terms mean the use of algorithms to acquire insights on relevant amount of data.

A common subdivision of the algorithms is: ▪ Supervised learning algorithms: ▪ Regression; ▪ Binary or Multinomial classification

▪ Non – supervised learning: ▪ Clustering ▪ Dimensionality reduction ▪ Association rules, Network analysis, …

«Statistical significance» as intended by classical statistical means is not useful in ML context due to the high amount of data on which models are trained (high «statistical power»).

ML in Actuarial Science: use cases ▪Fine tuning of frequency and severity modeling for non – life pricing, due ML models better handle interactions between variables and non – linearities. ▪Individual claim reserving («claim level analytics») ▪As above, for retention and conversion modeling

▪Fraud risk assessment ▪Marketing analytics ▪Recommender systems

ML Projects workflow ▪Business scope definition: ▪ business context ▪

Type of approach(supervised/ not supervised), available predictions, potential deployment issues

▪Data preparation: ▪ ETL: extraction, transformation and load; ▪

Initial descriptive analysis (univariate, bivariate plots and statistics, possible variable transformation)

▪Modeling and deployment ▪ Selection of candidate models ▪

Models’ fit



Performance Assessment



Deployment

Models validation ▪ML focuses on «predictive» performance instead of «explicative» power. ▪Performance metrics depends by outcome nature: ▪ Regressions: RMSE, 𝑅2 , 𝑀𝐴𝐸 … ▪ Classification: AUC/GINI, LogLoss,…

▪«predictive performance» shall be evaluated in terms of generalizability (will the fitted model work well on unseen data).

Models validation ▪«hold – out» approach: random (or reasoned) split of available data into train, validation and/or test sets. Models are fit on train set, possibly chosen on the validation one and predictive performance evalutated on test set ▪«cross – validation» approach: ▪ A k integer (eg. 10, 5,…) is chosen, the original sample is split into k random folders (k-fold cv); ▪ k model «runs» are fit, every time taking out an «hold out» data set on which predictive performance is calculated; ▪ The estimated predictive performance is the average of the k estimates.

▪«k – fold» cv is more precise, but more computational demanding.

Performance assessment: continuous outcomes

Performance assessment: continuous outcomes

Performance assessment: binary outcomes ▪Metrics: ▪ Accuracy =

𝑇𝑃+𝑇𝑁 𝑃+𝑁 𝑇𝑃

▪ Sensitivity / TPR / Recall= 𝑇𝑃+𝐹𝑁 𝑇𝑁

TP FP

FN

TN

TP+FN=P FP+TN=N

▪ Specificity / 1-FPR = 𝑇𝑁+𝐹𝑃 𝑇𝑃

▪ Precision / PPV = 𝑇𝑃+𝐹𝑃 𝑇𝑁

▪ Recall = 𝑇𝑁+𝐹𝑁 𝑃𝑃𝑉∗𝑇𝑅𝑃

▪ F1-score: 2 ∗ 𝑃𝑃𝑉+𝑇𝑅𝑃

Performance assessment: ROC, AUC e GINI

Performance assessment: loss metrics Continuous outcomes ◦ ◦

σ𝑖=1..𝑛 𝑦 ෞ𝑖 −𝑦𝑖 2 𝑅𝑀𝑆𝐸 = 𝑛 𝑛 σ ෞ𝑖 −𝑦𝑖 2 𝑦 𝑅2 = σ𝑖=1 𝑛 2 𝑖=1 𝑦𝑖 −𝑦𝑖 σ𝑖=1..𝑛 𝑦 ෞ𝑖 −𝑦𝑖

◦ MAE=

𝑛

Binary and multinomial outcomes: ◦ Gini = 2*AUC-1 ◦

𝑀 1 𝑀 logLoss =− 𝑁 σ𝑗=1 ෍ 𝑦𝑖𝑗 log𝑝𝑖𝑗 𝑗=1

Suggestions: ◦ Use «hold out» data when possible ◦ Try to evaluate current approach performance

Deployment & Life - cycle ▪Implement «on – line» all data preparation and scoring on production IT infrastructure ▪Necessary checks: ▪ Reasonableness of results ▪ Numerical checks on IT testing environments

▪Models’ life cycle: ▪ How often models shall be fit on fresher data? ▪ How often the modeling approach has been deeply reviewed due to different business environment?

Supervised learning: regression ▪Linear models: ▪ Normal multivariate regression; ▪ Generalized Linear Models (GLM)

▪ Linear Support Vector Machines (SVM). ▪Non linear models: ▪ Generalized Non Linear models; ▪ Mars Splines. ▪ Radial and polynomial SVM;

▪ K Nearest Neighbors (KNN) ▪ (Deep) Neural Networks ▪Trees: ▪ Single trees: es. C5.0 ▪ Bagging: Bagged Trees, Random Forest ▪ Boosted Trees: gbm, xgboost, lightgbm

Supervised learning: classification ▪Linear models: ▪ GLM (logistic, multinomial) possibly using non-linear or additive (splines) terms; ▪ Linear/Quadratic Discriminant Analysis

▪Non linear models: ▪ ▪ ▪ ▪ ▪

MARS Splines KNN SVM Naive Bayes (Deep) Neural Networks

▪Tree based approaches: ▪ Single trees: CHAID, C50 ▪ Bagging (Random Forest) ▪ Boosting (GBM, XGBoost, LightGBM)

Unsupervised modeling Clustering: ◦ Hierarchical clustering ◦ KMeans ◦ DBSCAN, OPTICS, …

Dimension reduction: ◦ PCA, Factor Analysis,… ◦ GLRM

Hybrid models: ◦ Arules ◦ Word2vec

Linear Discriminant Analysis

Support Vector Machines ▪Mathematical function (possibly non linear) creating separating regions in variables space. ▪Can be used both in classification and regression problems. ▪Issues: ▪ Computational complexity (o n3 ) ▪ No automatic feature selection ▪ Hard interpretability

SVM: Kernels

MARS Splines ▪Multivariate Adaptive Regression Splines are based on hinge functions, es k1 max 0; 𝑥 − 𝑐 + k 2 max 0; 𝑐 − 𝑥 to model the relation between predictors and the outcome. ▪Pros: ▪ Handlig both numeric and categorical data, interpretable non – linearity handling ▪ Allow for feature selection

▪Cons: ▪ More performant ML models exists ▪ Computational complexity reduce their ability to handle large data sets.

MARS Splines

KNN ▪KNN uses k-neighbors average value to predict new samples; K depends by the data set. ▪Can be used both for regression and classification ▪Pros: ▪ Easy and intuitive

▪Cons: ▪ Usually the fit is inferior than the one of other models.

▪ Computational complexity: 𝑂(𝑛 𝑑 + 𝑘 )

KNN

Hierarchical Clustering General algorithm: 1. A distance metric is defined 𝑛 2

2.

All

distance pairs are computed

3.

Closest pairs are combined and the algorithm starts again

▪Pros: ▪ Different distance metrics can be used ▪ Visual output (dendrogram) ▪Cons: ▪ 𝑂(𝑛3 𝑑) vs the 𝑂(𝑛 ∗ 𝑘 ∗ 𝑑) of Kmeans ▪ Subjective choice of distance threshold to define the cluster.

Hierarchical Clustering: dendrogram

ARULES Market basket analysis. Typically used to suggest most problable element that completes a set (es. different insurance covers for personal business)

It infers rules from a binary transaction set based on probabilistic rules. It can infer if-then rules in the form as «If you own A and B, then you can be insterested at C».

ARULES

TOOLS: H2O ▪Java based ML library that implements efficiently optimized version of broadly used ML algorithms: ▪ Supervised: GLM, Random Forest, GBM, XGBoost, Deep Learning, Naive Bayes and Stacked Ensemble. ▪ Unsupervised: PCA, KMEANS, GLRM ▪Functions: ▪ Open source tool, ▪ It interfaces with R and Python using dedicated libraries; ▪ Docs: http://docs.h2o.ai/ ▪It can be used: ▪ On desktop workstation (using multicore approach); ▪ On PC clusters; ▪ A dedicated version implements GPU calculations.

TOOLS: ML Models wrappers • R ML libraries that allows a tidy implementation of ETL, tuning and assessing models performance. • Different ML models can be fit and compared using a unified approach.

• Available libraries are: • R: caret e mlr • Python: scikit-learn

ML Interpretability ▪ML model opacity negatively affected their diffusion and populatiryt in many context, desplite they often offer significantly superior performance compared to traditional methods.

▪Thus, recent research focused on implementing algorithm to ease models assessment and interpretability (general and local). Statistical libraries that implements such algorithms are LIME e DALEX ▪Tools to ease interpretability are: ▪ ▪ ▪ ▪

Residuals distribution; Variable importance analysis; Partial dependency plots; Predictions breakdown

ML Interpretability: residuals analysis RESIDUALS ANALYSIS

RESIDUALS ANALYSIS

ML interpretability: variable importance analysis

ML interpretability: marginal effects plot Marginal effects plot

Partial groups of categorical predictors

ML interpretability: marginal effects plot

Suggest Documents