(Deep) Neural Networks. â«Trees: â« Single trees: es. C5.0. â« Bagging: Bagged Trees, Random Forest. â« Boosted Trees: gbm, xgboost, lightgbm ...
Machine Learning and Actuarial Science Core Concepts GIORGIO ALFREDO SPEDICATO, PHD FCAS FSA CSPA UNISACT 2018
Introduction «machine learning» and «data mining» terms mean the use of algorithms to acquire insights on relevant amount of data.
A common subdivision of the algorithms is: ▪ Supervised learning algorithms: ▪ Regression; ▪ Binary or Multinomial classification
▪ Non – supervised learning: ▪ Clustering ▪ Dimensionality reduction ▪ Association rules, Network analysis, …
«Statistical significance» as intended by classical statistical means is not useful in ML context due to the high amount of data on which models are trained (high «statistical power»).
ML in Actuarial Science: use cases ▪Fine tuning of frequency and severity modeling for non – life pricing, due ML models better handle interactions between variables and non – linearities. ▪Individual claim reserving («claim level analytics») ▪As above, for retention and conversion modeling
▪Fraud risk assessment ▪Marketing analytics ▪Recommender systems
ML Projects workflow ▪Business scope definition: ▪ business context ▪
Type of approach(supervised/ not supervised), available predictions, potential deployment issues
▪Data preparation: ▪ ETL: extraction, transformation and load; ▪
Initial descriptive analysis (univariate, bivariate plots and statistics, possible variable transformation)
▪Modeling and deployment ▪ Selection of candidate models ▪
Models’ fit
▪
Performance Assessment
▪
Deployment
Models validation ▪ML focuses on «predictive» performance instead of «explicative» power. ▪Performance metrics depends by outcome nature: ▪ Regressions: RMSE, 𝑅2 , 𝑀𝐴𝐸 … ▪ Classification: AUC/GINI, LogLoss,…
▪«predictive performance» shall be evaluated in terms of generalizability (will the fitted model work well on unseen data).
Models validation ▪«hold – out» approach: random (or reasoned) split of available data into train, validation and/or test sets. Models are fit on train set, possibly chosen on the validation one and predictive performance evalutated on test set ▪«cross – validation» approach: ▪ A k integer (eg. 10, 5,…) is chosen, the original sample is split into k random folders (k-fold cv); ▪ k model «runs» are fit, every time taking out an «hold out» data set on which predictive performance is calculated; ▪ The estimated predictive performance is the average of the k estimates.
▪«k – fold» cv is more precise, but more computational demanding.
Performance assessment: continuous outcomes
Performance assessment: continuous outcomes
Performance assessment: binary outcomes ▪Metrics: ▪ Accuracy =
𝑇𝑃+𝑇𝑁 𝑃+𝑁 𝑇𝑃
▪ Sensitivity / TPR / Recall= 𝑇𝑃+𝐹𝑁 𝑇𝑁
TP FP
FN
TN
TP+FN=P FP+TN=N
▪ Specificity / 1-FPR = 𝑇𝑁+𝐹𝑃 𝑇𝑃
▪ Precision / PPV = 𝑇𝑃+𝐹𝑃 𝑇𝑁
▪ Recall = 𝑇𝑁+𝐹𝑁 𝑃𝑃𝑉∗𝑇𝑅𝑃
▪ F1-score: 2 ∗ 𝑃𝑃𝑉+𝑇𝑅𝑃
Performance assessment: ROC, AUC e GINI
Performance assessment: loss metrics Continuous outcomes ◦ ◦
σ𝑖=1..𝑛 𝑦 ෞ𝑖 −𝑦𝑖 2 𝑅𝑀𝑆𝐸 = 𝑛 𝑛 σ ෞ𝑖 −𝑦𝑖 2 𝑦 𝑅2 = σ𝑖=1 𝑛 2 𝑖=1 𝑦𝑖 −𝑦𝑖 σ𝑖=1..𝑛 𝑦 ෞ𝑖 −𝑦𝑖
◦ MAE=
𝑛
Binary and multinomial outcomes: ◦ Gini = 2*AUC-1 ◦
𝑀 1 𝑀 logLoss =− 𝑁 σ𝑗=1 𝑦𝑖𝑗 log𝑝𝑖𝑗 𝑗=1
Suggestions: ◦ Use «hold out» data when possible ◦ Try to evaluate current approach performance
Deployment & Life - cycle ▪Implement «on – line» all data preparation and scoring on production IT infrastructure ▪Necessary checks: ▪ Reasonableness of results ▪ Numerical checks on IT testing environments
▪Models’ life cycle: ▪ How often models shall be fit on fresher data? ▪ How often the modeling approach has been deeply reviewed due to different business environment?
Supervised learning: regression ▪Linear models: ▪ Normal multivariate regression; ▪ Generalized Linear Models (GLM)
▪ Linear Support Vector Machines (SVM). ▪Non linear models: ▪ Generalized Non Linear models; ▪ Mars Splines. ▪ Radial and polynomial SVM;
▪ K Nearest Neighbors (KNN) ▪ (Deep) Neural Networks ▪Trees: ▪ Single trees: es. C5.0 ▪ Bagging: Bagged Trees, Random Forest ▪ Boosted Trees: gbm, xgboost, lightgbm
Supervised learning: classification ▪Linear models: ▪ GLM (logistic, multinomial) possibly using non-linear or additive (splines) terms; ▪ Linear/Quadratic Discriminant Analysis
▪Non linear models: ▪ ▪ ▪ ▪ ▪
MARS Splines KNN SVM Naive Bayes (Deep) Neural Networks
▪Tree based approaches: ▪ Single trees: CHAID, C50 ▪ Bagging (Random Forest) ▪ Boosting (GBM, XGBoost, LightGBM)
Unsupervised modeling Clustering: ◦ Hierarchical clustering ◦ KMeans ◦ DBSCAN, OPTICS, …
Dimension reduction: ◦ PCA, Factor Analysis,… ◦ GLRM
Hybrid models: ◦ Arules ◦ Word2vec
Linear Discriminant Analysis
Support Vector Machines ▪Mathematical function (possibly non linear) creating separating regions in variables space. ▪Can be used both in classification and regression problems. ▪Issues: ▪ Computational complexity (o n3 ) ▪ No automatic feature selection ▪ Hard interpretability
SVM: Kernels
MARS Splines ▪Multivariate Adaptive Regression Splines are based on hinge functions, es k1 max 0; 𝑥 − 𝑐 + k 2 max 0; 𝑐 − 𝑥 to model the relation between predictors and the outcome. ▪Pros: ▪ Handlig both numeric and categorical data, interpretable non – linearity handling ▪ Allow for feature selection
▪Cons: ▪ More performant ML models exists ▪ Computational complexity reduce their ability to handle large data sets.
MARS Splines
KNN ▪KNN uses k-neighbors average value to predict new samples; K depends by the data set. ▪Can be used both for regression and classification ▪Pros: ▪ Easy and intuitive
▪Cons: ▪ Usually the fit is inferior than the one of other models.
▪ Computational complexity: 𝑂(𝑛 𝑑 + 𝑘 )
KNN
Hierarchical Clustering General algorithm: 1. A distance metric is defined 𝑛 2
2.
All
distance pairs are computed
3.
Closest pairs are combined and the algorithm starts again
▪Pros: ▪ Different distance metrics can be used ▪ Visual output (dendrogram) ▪Cons: ▪ 𝑂(𝑛3 𝑑) vs the 𝑂(𝑛 ∗ 𝑘 ∗ 𝑑) of Kmeans ▪ Subjective choice of distance threshold to define the cluster.
Hierarchical Clustering: dendrogram
ARULES Market basket analysis. Typically used to suggest most problable element that completes a set (es. different insurance covers for personal business)
It infers rules from a binary transaction set based on probabilistic rules. It can infer if-then rules in the form as «If you own A and B, then you can be insterested at C».
ARULES
TOOLS: H2O ▪Java based ML library that implements efficiently optimized version of broadly used ML algorithms: ▪ Supervised: GLM, Random Forest, GBM, XGBoost, Deep Learning, Naive Bayes and Stacked Ensemble. ▪ Unsupervised: PCA, KMEANS, GLRM ▪Functions: ▪ Open source tool, ▪ It interfaces with R and Python using dedicated libraries; ▪ Docs: http://docs.h2o.ai/ ▪It can be used: ▪ On desktop workstation (using multicore approach); ▪ On PC clusters; ▪ A dedicated version implements GPU calculations.
TOOLS: ML Models wrappers • R ML libraries that allows a tidy implementation of ETL, tuning and assessing models performance. • Different ML models can be fit and compared using a unified approach.
• Available libraries are: • R: caret e mlr • Python: scikit-learn
ML Interpretability ▪ML model opacity negatively affected their diffusion and populatiryt in many context, desplite they often offer significantly superior performance compared to traditional methods.
▪Thus, recent research focused on implementing algorithm to ease models assessment and interpretability (general and local). Statistical libraries that implements such algorithms are LIME e DALEX ▪Tools to ease interpretability are: ▪ ▪ ▪ ▪
Residuals distribution; Variable importance analysis; Partial dependency plots; Predictions breakdown
ML Interpretability: residuals analysis RESIDUALS ANALYSIS
RESIDUALS ANALYSIS
ML interpretability: variable importance analysis
ML interpretability: marginal effects plot Marginal effects plot
Partial groups of categorical predictors
ML interpretability: marginal effects plot