A Comparison of Data Mining Methods in Microfinance - IEEE Xplore

5 downloads 12035 Views 94KB Size Report
Hence, this paper focuses on comparing different data mining methods when applied to loan data for sub-prime lenders. Keywords-Microfinance; Data mining; ...
A Comparison of Data Mining Methods in Microfinance

Jia Wu, Sunil Vadera, Karl Dayson

Diane Burridge, Ian Clough

University of Salford 43 Crescent, Salford, United Kingdom [email protected]

East Lans Moneyline (IPS) Ltd Lancashire United Kingdom

government, sub-prime lenders have been set up to provide personal loans for low income borrowers. Sub-prime lenders include credit unions, local authorities and moneylines. Subprime lenders play an important role in filling the gap between mainstream lenders and loan sharks. They provide affordable loans to low income borrowers. Due to the higher risk, the interest rate is slightly higher than mainstream lenders. However, it is a great deal lower than loan sharks. In order to assess the risk of loans, all the mainstream lenders have their own loan risk assessment systems which they have developed. These systems are more commonly known as credit scoring systems. These credit scoring systems aim to assess the probability that a customer will meet their financial commitments [1]. However, these systems are normally “in house”. At the same time, there are some other credit scoring companies that provide personal credit information for a fee, such as Experian. However, Experian and other similar systems are not applicable to sub-prime lenders. This marginal group of clients that sub-prime lenders serve normally are unable to obtain credit card services from mainstream lenders. In this case, a partial or non existent credit card payment history can be used for credit scoring, which directly affects the accuracy of the credit scoring system. In order to reduce lending risk, sub-prime lenders need to develop their own loan risk assessment system. However, there is no such product available on the market for subprime lenders to purchase. East Lancashire Moneyline (ELM) is one of the sub-prime lenders in UK. It is a not for profit loan company. Currently ELM is part of a KTP (Knowledge Transfer Partnership) project as the company partner. The University of Salford is the knowledge base of this KTP project. The aim of this KTP project focuses on applying data mining technology to develop a loan risk assessment system for ELM. In this paper, Section II presents the problem and the background. Section III shows the potential solutions for the problems. Section IV introduces the results of applying different data mining methods on the collected sub-prime loan data as well as a the German Credit data set.

Abstract— Microfinance provides financial services to low income or poor credit record clients. The ‘credit crunch' has led to mainstream lenders tightening their lending policies, resulting in increased financial exclusion. Loan sharks then become an alternative and easy way of borrowing money. However, extremely high interest rates from loan sharks put low income people into worse poverty. Sub-prime lenders play an important role in providing affordable loans to fill the gap between loan sharks and mainstream lenders. All the mainstream lenders have their own loan risk assessment systems, but these systems are either ‘in house’ or not applicable for giving loans to this marginal group of client. Due to the varying characteristics of this marginal group of clients, sub-prime lenders need to develop their own loan risk assessment system. Although data mining methods have the potential for developing such a risk assessment system, the relative performance of the different data mining methods on such data is not known. Hence, this paper focuses on comparing different data mining methods when applied to loan data for sub-prime lenders. Keywords-Microfinance; Data mining; Exemplar based model; Bayesian network; Decision tree; Clustering

I.

INTRODUCTION

Since 2008, the global economy has been hit by the “credit crunch”. Almost all of the mainstream lenders have tightened their lending policy and have axed riskier lending. It is now much harder for people, especially low income or poor credit rate borrowers, to access low interest rate credit services. Because of the potential risk, mainstream lenders do not provide personal loans for people with no job or with a poor credit history. Due to the recession, the unemployment rate is high. People need personal loans more to help them get through the recession. When they have been denied by mainstream lenders, loan sharks become their “easy” but dangerous option. Compared with the low interest rates provided by mainstream lenders, the interest rates from loan sharks are extremely high; the interest rate can be up to 2,500,000%. People struggle to even pay back the interest they own long before they start paying back the capital. Loan sharks sometimes also use threats and violence to frighten people who can't pay back their loan. In order to reduce financial exclusion and prevent people borrowing from loan sharks, with help from the UK

II.

THE PROBLEM AND THE BACKGROUND

Developing and applying a loan risk assessment system for a loan company involves not only the technical issues but also management issues. The technical issues include the

___________________________________ 978-1-4244-6928-4/10/$26.00 ©2010 IEEE

499

general problems which exist for any loan risk assessment model. The management issues are about embedding and integrating of the developed system in to the business processes and working practices of sub-prime lenders. The following sections discuss the main challenges of this project according to these two main aspects.

B. Management problems The objective of this project is not only for developing a loan risk assessment system but also integrating the system into the company’s daily management process. Management integration of the system is very important for the successful completion of the whole project. Technical problems can normally be solved by the project team members. However people who are involved in management integration include not only the project team member but also company staff, customers, or even company partners. Management integration can be divided into many sub sections, such as system integration, staff training, system administration, user feedback collection, and system maintenance. So the challenge in management integration is C5: How to integrate the developed loan risk assessment system into management process well? This above raises five main challenges in the loan risk assessment project. Three of them (C2, C3, C4) are about selecting a main technology in loan risk assessment, C1 is related to the data quality and C5 is about the management issue. The following section introduces the potential data mining methods for loan risk assessment project.

A. Technical problems This research focuses on applying data mining technology in developing a loan risk assessment system. Data mining is a process of extracting or finding the hidden patterns or rules from the data. It transforms the data into information or knowledge which can be easily understood, especially for a large amount of data. Data is the foundation of data mining. The quality and quantity of data directly affect the data mining result. Data quality is the key factor for most data mining projects. During knowledge discovery, data pre-processing is the essential preparation stage before the data mining process. Data pre-processing includes data cleaning, data integration, data selection and data transformation. ELM has a backend database to keep partial loan data. However, some data in the loan database are missing noisy or inconsistent. Some of the data required for loan risk assessment is not recorded electronically but paper based. One question related to data quality is: C1: How to apply data pre-processing technologies on the raw data in this project? ELM has been established since 2002. With the rapid expanding speed, the amount of loan data increases fast. However, comparing with the main stream lenders, ELM is a young and not for profit loan company. The historical loan dataset is not large. Therefore one of the challenges of this project is: C2: Which data mining technology should be chosen when there isn’t a large amount of data is provided? Currently, there are many methods that have been adopted for loan risk assessment research. Every method has its strengths and weaknesses. The main purpose of developing a loan risk assessment tool is to increase the accuracy of prediction of a customer’s payment performance. Hence, the third difficulty is: C3: Select a loan risk assessment technology which can provide reasonable accuracy. Many developed loan risk assessment systems do not provide a reason for the assessment result they produce. and the primary measure of performance of the system is accuracy. To some companies, accuracy is enough. The structure of the generated model and the reason why the model works is unknown or unable to be interpreted. However, some companies, such as ELM, are interested on not only in the accuracy but also the reason that the system works. Instead of a ‘black box’ assessment, a transparent assessment is required.. Not all the loan risk assessment technologies are able to produce a model that shows the explanation which can be easily understood by a human. Therefore, another requirement for this system is: C4: Choose a loan risk assessment technology which is able to provide explanation of the generated model.

III.

THE POTENTIAL SOLUTIONS FOR THE PROBLEMS

There are numbers of loan risk assessment technologies which are available from data mining or statistics. In statistics, credit scoring is a widely known pragmatic approach to the credit granting problem: ‘If it works use it!’ [8]. That is, the main concern is about the result and not necessarily providing an explanation or reason for the result. Because of C4, which is described in section 2, this project requires results with an explanation.. Hence, this research focuses on data mining methods, some of which are claimed to be able to provide explanations or patterns as well as the result. There are various data mining technologies that generate different types of patterns. For example, as an exemplar based model, FReBE (Family Resemblance Exemplar Based Model) uses exemplars to describe data, Naïve Bayes represents data by a simple Bayesian network, Decision tree algorithms produce trees and clustering algorithms generate clusters to represent data. The following subsections give brief introductions for decision tree, clustering, Naïve Bayes, and FReBE which are then compared in section IV. A. Decision tree Decision tree induction [5] is a supervised learning method which uses training examples that include the outcome or class and produces a tree as a model. In a tree structure, leaves represent decisions, and branches represent a set of features that lead to the decisions. A decision tree is an arrangement of tests that prescribes an appropriate test at every step in an analysis. It classifies an instance by starting at the root node of the decision tree, testing the feature specified by this node, then moving down the tree branch according to the feature value. After performing a sequence of tests, the process ends when a leaf node is reached.

500

One of the most popular decision tree algorithms is ID3 (Iterative Dichotomiser 3) [5] which is based on information entropy. C4.5 is an improvement of ID3 algorithm [6][7]. that can handle both continuous and discrete values. It is able to handle missing attributes values and also enables pruning, which is a process of reducing sub-trees that lead to overfitting the data.

Random sub-sampling is adopted as the experimental methodology for assessing the accuracy of the data mining methods. The experiments involved training the models using 70% of a dataset and testing with the remaining 30%. Twenty five random trials were carried out for the average accuracy. The algorithms used from WEKA are: (i) J48 as the decision tree learning algorithm; (ii) EM and K-means [4] are used as the clustering methods; (iii) and the Naïve Bayes implementation is used. Experiments are carried out on two different databases.: (i) a small loan dataset from ELM, and (ii) the benchmarking credit card dataset from the UCI machine learning repository. The following two sections present the results and evaluation based on these two different datasets.

B. Clustering Clustering is an unsupervised learning algorithm. There is no category or class predefined in the dataset. It partitions a collection of data into natural groups (clusters). The objective of clustering is to group objects so that objects in a cluster are mutually similar, but very dissimilar to objects in other clusters. There are many different clustering algorithms. COBWEB [2] is known as a hierarchical and model based clustering method. COBWEB learns cases by considering the effect that creating new clusters, merging clusters and partitioning clusters has on an evaluation function and aims to optimize its value. Simply k-means [4] is another clustering method. It partitions the dataset in to k clusters in order to result in high intracluster similarity and low intercluster similarity by calculating the means of the same cluster data.

A. ELM Loan Dataset The ELM data consists of 109 loan cases, each with 19 attributes and an outcome indicating the risk (high risk, medium risk, and low risk). The data collected includes 45 low risk loans, 30 medium risk loans and 34 high risk loans. The accuracy obtained from the five different data mining methods is shown in the Table I below: TABLE I.

Data Mining Methods

C. Naïve Bayes Classifier The Naïve Bayes classifier is a simple but popular Bayesian classification model. It is a probabilistic classifier which applies Bayes’ theorem, based on the assumption that the variables are independent, and thereby reducing a multidimensional task into a number of one dimensional density tasks.

J48

Accuracy

35.71% ±7.63%

K-means

EM

Naïve Bayes

FReBE

38.76% ±9.62%

37.39% ±9.37%

36.17% ±10.06%

38.0% ±9%

The results of Table I show that the accuracies are not high. None of the methods shows significantly higher than other methods especially with the high standard deviation. This probably because the size of ELM dataset is small, but does confirm that learning from a small data set in this domain is challenging. The following section shows the results from a larger dataset.

D. FReBE FReBE [10] is an exemplar based model with foundations in Bayesian networks. FReBE learns the knowledge in the similar way as human learning. Instead of remembering all the cases, a human learns the knowledge in the form of typical exemplars. FReBE identifies exemplars from concept by using a measure of family resemblance which was originally raised by the German philosopher Wittgenstein [9] and utilized in the PEBM model [11].

B. German Credit Dataset The German credit dataset includes 1000 data, where each example has 20 attributes. It has two classes which are ‘Good’ and ‘Bad’. There are 700 good cases and 300 bad cases. The accuracy from the five different data mining methods is shown in the Table II below:

All four data mining methods have their advantages. However, considering the application area, accuracy becomes an important standard for method selection. The experiments are carried out by applying all the data mining technologies on the real loan cases and a benchmark credit card dataset. The following section presents the experimental methodology and results. IV.

RESULTS OF ELM DATASET

TABLE II.

RESULTS OF GERMAN CREDIT DATASET Data Mining Methods

J48

Accuracy

THE EXPERIMENTAL EVALUATION

71.44% ±1.77%

K-means

EM

Naïve Bayes

FReBE

57.47% ±4.31%

38.77% ±9.26%

75.09% ±2.19%

74.10% ±3.00%

The results of Table II show that J48, Naïve Bayes and FReBE perform better than Clustering methods. The accuracy from FReBE and Naïve Bayes is slightly higher than J48 when considering the standard deviation. Comparing the five different methods, the main difference is the knowledge representation and whether the categories are defined in advance. Decision tree induction

In order to produce the results from a decision tree induction algorithm, clustering and Naïve Bayes, the data mining tool called ‘WEKA’ is used. WEKA is a collection of algorithms for solving real-world data mining problems. Using WEKA, different data mining methods are able to be applied to extract data patterns.

501

two datasets are carried out. The results from the different algorithms are analysed and. based on the discussion, decision trees appear to be the most appropriate data mining technology for developing a loan risk assessment system for sub-prime lenders. Work on collecting and cleaning the data is continuing and future work will include using costsensitive data mining methods .for this challenging task.

requires predefined categories that have to be defined... This is not as straightforward as might appear, For this initial trail, high risk, medium risk, and low risk were used and based on the number of missed payments following discussion with the loan managers from ELM. As more data is collected, and accuracy increases, more categories may be relevant and further refinement may be necessary.. Clustering divides the data into subsets without knowing the categories. It lets the computer analyse and figure out the links among data and is a more nature way to group the data. FReBE learns the structure, in the form of a Bayesian network that represents exemplars, from data. However, within each category, FReBE may be considered as performing unsupervised learning as exemplars are produced by clustering according to the family resemblance principle. FReBE and Naïve Bayes produce the results in the form of Bayesian networks. J48 generates decision trees which can be transformed into sets of rules or visualized directly. Compared with the patterns produced by other data mining methods, a set of rules or decision trees are easier to be understand by staff who are not from a techincal background. For the purpose of management and risk control, a company, such as ELM, is able to bring expertise to the decision making process by monitoring, learning or improving the rules. It makes the whole loan risk assessment more flexible and transparent which is a key to developing confidence in any loan decisions that are taken.

REFERENCES M. Berlin, and L. J. Mester “On the Profitability and Cost of Relationship Lending”, 97-3, Working Papers, Federal Reserve Bank of Philadelphia, 1997. [2] D. H. Fisher “Knowledge Acquisition via Incremental Conceptual Clustering”, Machine Learning, vol.2, no.2, 1987, pp.139-172. [3] J. Han and M. Kamber Data Mining: Concept and Techniques, Morgan Kaufmann, 2006. [4] J. Hartigan and M. Wong “A K-means clustering algorithm”, JR Stat. Soc. Ser. C-AAl. Stat, vol.28, 1979, pp.100-108. [5] J. R. Quinlan “Induction of Decision Trees”, Machine Learning, vol.1, no.1, 1986, pp. 81-106. [6] J. R. Quinlan C4.5: Programs for machine learning, Morgan Kaufmann Publishers Inc., San Francisco, 1993. [7] J. R. Quinlan “Bagging, Boosting, and C4.5”, AAAI/IAAI, vol.1, 1996, pp.725-730. [8] L. C. Thomas “A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers”, International Journal of Forecasting, 16, 2000, pp. 149-172. [9] L. Wittgenstein Philosophical Investigations, Blackwell Publishers, 1973. [10] J. Wu A Study and Development of Bayesian Exemplar based Models, PhD thesis, 2008, University of Salford. [11] Vadera,S., Rodriguez, A, Sucar, E and Wu,J (2008). Using Wittgenstein’s family resemblance principle to learn exemplars, Foundations of Science, Volume 13, No1, pp67-74. [1]

V. CONCLUSIONS This paper introduces a case study of applying different data mining technologies in developing a loan risk assessment system for a sub-prime lender. The challenges and application background are stated. The potential approaches for the problems are examined. Experiments on

502

Suggest Documents