Loan Default Prediction on Large Imbalanced Data ...

TELKOMNIKA Indonesian Journal of Electrical Engineering Vol.10, No.6, October 2012, pp. 1519~1525 1519

Loan Default Prediction on Large Imbalanced Data Using Random Forests Lifeng Zhou, Hong Wang* School of Mathematics and Statistics, Central South University No. 22, Shaoshan South, Changsha, Hunan,410075, China *corresponding author, e-mail: [email protected]*

Abstract In this paper, we propose an improved random forest algorithm which allocates weights to decision trees in the forest during tree aggregation for prediction and their weights are easily calculated based on out-of-bag errors in training. Experiments results show that our proposed algorithm beats the original random forest and other popular classification algorithms such as SVM, KNN and C4.5 in terms of both balanced and overall accuracy metrics. Experiments also show that parallel random forests can greatly improve random forests’ efficiency during the learning process. Keywords: loan default prediction, random forests, imbalanced data, parallel computing Copyright © 2012 Universitas Ahmad Dahlan. All rights reserved.

1. Introduction Predicting whether a borrower would default on his/her loan is of vital importance for bankers as default prediction accuracy will have great impact on their profitability[1]. From the machine learning perspective, loan default prediction (also called credit scoring [2]) can be viewed as a binary classification problem. Previous approaches focus on prediction using ensemble methods [3] or fuzzy systems [4]. Neural networks, successfully applied to various fields [5, 6], also finds applications in the default prediction problem [7]. Most of them assume that “good” cases and “bad” cases of the data sets are equally distributed [3] or simply ignore the class distribution of data sets [4, 7]. However, most available loan defaults data sets are highly skewed, i.e. there are many cases in one class, few in another. For example, the data drawn from “Give Me Some Credit” competition on Kaggle [8] and used in this paper are also highly imbalanced. This imbalance of data causes many standard learning algorithms perform poorly on the minority class [9]. Moreover, the data sets on Kaggle.com contain hundreds of thousands of data records and this kind of large data is no rare thing in today’s data-intensive age. Large amount of data makes learning or prediction of the data takes hours or even days to complete while wasting the computing power provided by the multi-core CPU system in desktop computers or the multiCPU system in servers or clusters. To speed up the computing process, some kind of parallelism is needed as the one used in [10]. To date, there exists no specialized algorithm coping with both the imbalance and large data problem in loan default prediction. In this paper, we try to make loan default prediction on imbalanced data sets with an improved random forests approach which employs weighted majority votes in tree aggregation. The weights assigned to each tree in the forest are based on OOB(out-of-bag) errors which are easy to obtain during the forest construction process. Also, due to the fact that random forests can be paralleable in nature, we employ the foreach package in the statistical software R [11] to make random forest parallelable and greatly reduce the learning time. The rest of the paper is organized as follows: section 2 gives a short introduction to random forests and discusses two kinds of random forests to deal with the problem in question and then proposes an improved random forests with section 3 presents the experiment results and section 4 concludes the paper.

Received July 5, 2012; Revised September 30, 2012; Accepted October 7, 2012

1520

e-ISSN: 2087-278X

2. Research Method Random forests proposed in [12] is a learning ensemble consisting of a bagging [13] of unpruned decision tree learners with a randomized selection of features at each split. The random forest algorithm outperforms many state-of-art classifiers such as SVM, ANN and does not tend to be overfitting [12]. The original random forest algorithm (for classification and regression) is as follows:

Algorithm 1: Original Random Forest Input: T= the given training set; Ntree=the number of trees to be built; Mtry = the number of predictors chosen for splitting at each node; Ssampsize= the size of bootstrapped sample. Process: for ( itree=1; itree

Loan Default Prediction on Large Imbalanced Data ...

Loan Default Prediction on Large Imbalanced Data ...

Suggest Documents

Online Defect Prediction for Imbalanced Data - CiteSeerX

Imbalanced target prediction with pattern discovery on clinical data ...

IRJET- Neural Network Approach to Loan Default Prediction

Boosting Prediction Accuracy on Imbalanced Datasets ... - CiteSeerX

Economics Letters Contagious loan default

Software Defect Prediction in Imbalanced Data Sets ... - Springer Link

Constrained Classification of Large Imbalanced Data by Logistic ...

Experimental Perspectives on Learning from Imbalanced Data

Learning Concepts from Large Scale Imbalanced Data ... - Google Sites

Constrained Classification of Large Imbalanced Data by ... - ijmlc

effectiveness of loan default management strategies

Factors Associated with Student Loan Default Among

Bank-Loan Loss Given Default - QuantLabs.net

Modeling Time to Default on a Personal Loan ...

Bank-Loan Loss Given Default - QuantLabs.net

Medical imbalanced data classification - ASTESJ

Prediction of Loan Deficiency Payments

Assessment of Loan default Trend on the Amount of Loan Granted to ...

imbalanced data learning approaches review

Classification of Large Imbalanced Credit Client

Special Issue on Learning from Imbalanced Data Sets - CiteSeerX

A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced ...

Imbalanced Data Sets Classification Based on SVM for Sand-Dust ...

Rotating Machinery Fault Diagnosis for Imbalanced Data Based on