Fourth International Conference on Natural Computation
A Hybrid Credit Scoring Model Based on Genetic Programming and Support Vector Machines Defu Zhang, Mhand Hifi, Qingshan Chen, Weiguo Ye Department of Computer Science, Xiamen University, Xiamen, 361005, China Universit´e de Picardie Jules-Verne, Amiens, 80039, France Department of Management Science, Xiamen University, 361005, China
[email protected],
[email protected] second is how to adjust the credit restrictions or the marketing effort directed at a current customer (behavioral scoring) [1]. This paper is trying to deal with the first decision making problem. During this process, data mining techniques are adopted to evaluate the credit risk of a potential applicant on the basis of bank's historical data. The objective of credit scoring model is to classify credit applicants to either a “good credit” group that is likely to repay financial obligation or a “bad credit” group who has high possibility of defaulting on the financial obligation. Practitioners and researchers have proposed a variety of traditional statistical methods and artificial intelligence methods for credit scoring. Roughly, we can divide these methods into the induction-based (also called IF-THEN rules) methods [2, 3, 4] (e.g. rough sets, classification and regression tree (CART), decision tree (C4.5/5.0), genetic programming (GP)) and function-based methods [5, 6, 7] (e.g. logistic regression (LR), support vector machines (SVM), artificial neural network (ANN)). For induction-based methods, the main advantage is that they can provide the intelligence classification rules for decisionmakers. These intelligence classification rules can help decision-makers to understand the contents of the data sets and make the correct decision. However, the main problem of induction-based methods is the ability of forecasting. It is clear that if a newly entered instance does not match any rule or match more than one rule, it cannot be determined to which class it belongs. For function-based methods, it has very strong capability of forecasting. However, the main problem of these methods is commonly described as opaque structures because they generate complex mathematical internal works which are hard for human to interpret. Recently, some researchers obtained some promising results by considering the above problems [8, 9, 10, 11]. In this paper, we developed a hybrid credit scoring model (HCSM) to integrate the advantage of the
Abstract Credit scoring has obtained more and more attention as the credit industry can benefit from reducing potential risks. Hence, many different useful techniques, known as the credit scoring models, have been developed by the banks and researchers in order to solve the problems involved during the evaluation process. In this paper, a hybrid credit scoring model (HCSM) is developed to deal with the credit scoring problem by incorporating the advantages of genetic programming and support vector machines. Two credit data sets in UCI database are selected as the experimental data to demonstrate the classification accuracy of the HCSM. Compared with Support Vector Machines, Genetic Programming, Decision Tree Classifiers, Logistic Regression, and Back-propagation Neural Network, HCSM can obtain better classification accuracy.
1. Introduction The credit card industry has been growing rapidly. However, developments also come with problems. The frequency of credit fraud is rising, and subsequent losses incurred by of banks are increasing; worse, the discovery of this type of fraud is delayed by several weeks. Therefore, it is important for banks to accurately evaluate the credit default risk of the applicants in advance. A good credit scoring model will help banks make right decisions in order to avoid the potential risk. Because of these great benefits, it has become increasingly significant to improve the accurateness of the credit scoring model for the sake of banks and the whole society. Generally speaking, there are two types of model to assist with the decision-making process. The first decision that creditors have to face is whether to grant credit to a new applicant (credit scoring), and the
978-0-7695-3304-9/08 $25.00 © 2008 IEEE DOI 10.1109/ICNC.2008.205
8
Authorized licensed use limited to: Xiamen University. Downloaded on December 26, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
resultant sub-trees are swapped, generating two new individuals. These new individuals become part of the next generation of individuals to be evaluated. GP uses the mutation operator to avoid falling into the local optima. The mutation operator can be applied to either a function node or a terminal node. Randomly select a node in a sub-tree and replace it with a new created sub-tree randomly. Finally, the copy operator can choose an individual in the current population and copy it without changes into the new population. For the basic concepts of SVM for typical two-class classification problems, the reader can refer to Reference [13, 14, 15].
function-based and the induction-based methods. First, we use GP to derive the IF-THEN rules. Then, the training data set which does not match those IF-THEN rules or match more than one rule is used to train SVM and build the discriminant function. For conventional statistical classification techniques, an underlying probability model must be assumed in order to calculate the posterior probability upon which the classification decision is made. GP and SVM can perform the classification task without this limitation. In this paper, two real credit scoring data sets are used to demonstrate the HCSM and compare with other conventional models. The experimental results show that HCSM is very efficient.
3. Hybrid credit scoring model (HCSM)
2. Basic concepts of GP and SVM
In this section, we depict the procedures of HCSM. In the first stage of HCSM, GP is employed to derive the intelligence classification rules for decision-maker. Reference [16] proposed a framework for discovering comprehensible classification rules. In this paper, GP is based on this framework. In order to satisfy our model, we modify the GP as follows. First, discretization of continuous attributes should be employed before GP for obtaining the classification rules efficiently. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. In this paper, we select the Boolean reasoning algorithm [17]. Then, the maximum GP-tree depth of six is enforced to ensure for obtaining a simple rule. In addition, the function set only consists of the logical connectives {AND, OR, NOT}, the relation operators {≤, = , ≥}, and the conditional operators {IF…THEN…}. The terminal set is simply predicting attributes Ai or values in the domain of Ai. Next, the measurement of fitness is a rather nebulous subject since it is highly problem-dependent. In this paper, we consider two factors: one is the classification accuracy, another is the misclassification costs. They are two types of misclassification in the credit scoring model: Type Ⅰerrors mean a customer with good credit is misclassified as one with bad credit and Type Ⅱ errors mean a customer with bad credit is misclassified as one with good credit. The research shows that in the commercial bank risk management the cost of Type Ⅱerrors are higher than the Type Ⅰerrors [18]. Therefore it is necessary to consider different misclassification cost, make it as low as possible under the same accuracy. So, the fitness function of GP can be described as:
GP was first proposed by Koza [12]. In GP, the basic idea is the evolution of a population of “programs”, i.e., candidate solutions to the specific problem at hand. A program (an individual of the population) is usually represented as a tree, where the internal nodes are functions set and the leaf nodes are terminal set. Both the function set and the terminal set must contain symbols appropriate for the target problem. The function set can contain arithmetic operators, logic operators, mathematical functions, etc., whereas the terminal set can contain the variables, as well as constants. For example to express ((x>2) && (y=1)), the GP tree can be represented as Fig.1.
Fig.1. The representation of GP tree. The major steps of genetic programming can be formalized as follows: generate at random an initial population of individuals representing potential solutions to the classification problem for the class at hand; evaluate each individual on the training set by means of a fitness function; on the basis of Darwin’s evolutionary theory, select genetic operators (e.g. crossover, copy, mutation) to produce new individual, until an acceptable classification individual is found or the specified maximum number of generations has been reached. Next, we introduce three main operators, crossover, mutation, and copy. In GP, the crossover operates on two individuals, and produces two child individuals. Two random nodes are selected from within each individual and then the
ff i = ra + c1re1 + c2 re 2 , − 1 < c1 , c 2 < 0
9
Authorized licensed use limited to: Xiamen University. Downloaded on December 26, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Now, we introduce the procedures of classifying newly entered instance. When a newly entered instance is fed in the model, the instance can be classified as the following situations: first, if the instance satisfies one of the rules, the instance is simply assigned the class; second, if the instance does not satisfy any rules or satisfy more than one rule, the instance is assigned the class predicted by the discriminant function. In summary the statements above, the whole procedures of HCSM can be shown in Fig.2. In the next section, we use two real word data sets to demonstrate the performance of HCSM.
where the ffi denotes the fitness of ith individual, ra denotes the predict accuracy, re1 denotes the ratio of Type I errors, re2 denotes the ratio of Type Ⅱerrors, c1 denotes the cost of Type I errors, c2 denotes the cost of Type Ⅱerrors. In addition, in order to avoid producing an invalid child, only the same types of operator can be swapped. Then we can implement GP to explicit intelligible classification rules for each class in a database. For example, if we extract rules from a data set with two classes, then the rules can be interpreted as: Rule 1: IF (A1 ≥ 3 AND A5 = 1) THEN customer =good credit. Rule 2: IF (A1 < 2 OR A3 = 2) THEN customer = bad credit. If we want to derive another rule in a class, the train data sets which satisfy any rules should not be trained again. This combination determines a class description which is used to construct the classification rule. For the second stage of HCSM, the remaining data set is employed to build the discriminant function for enhancing the capability of forecasting. The remaining data set is defined as the data which does not satisfy any rules or satisfy more than one rule. The remaining data set is used to train SVM model and the discriminant function is built as follows:
4. The experimental results The two real world data sets: the Australian and German credit data sets, are available from the UCI Repository of Machine Learning Databases [19], and are mostly used to compare the accuracy with various classification models. The Australian credit data consists of 307 instances of creditworthy applicants and 383 instances where credit is not creditworthy. It contains 14 attributes, which include 6 continuous attributes and 8 categorical attributes. The German credit data are more unbalanced, there are 700 instances judged to creditworthy and 300 instances judged to not creditworthy. It contains 20 attributes, which include 7 continuous attributes and 13 categorical attributes. In this paper, HCSM is compared with GP, SVM, C4.5, LR, and Back-propagation Neural Network (BPN) using the two real world data sets. The parameters of HCSM are set as follows: for the first stage the parameters of GP are depicted in Table 1. For the second stage the SVM is implemented by the C language-libsvm [20], the radial basis function (RBF) is used as the basic kernel function of SVM. There are two parameters associated with RBF kernels: C and γ. Reference [21] suggested a practical guideline to SVM using grid search and cross-validation, and this study will utilize it. After conducting the grid-search for remaining data set, we found that the optimal (C, γ) was (211, 2-3). For the BPN model, several options of the neural network configurations are proposed [22], in which 1432-1 and 20-32-1 respectively for the Australian data and German data are selected to obtain better results. Additionally, the learning rate and momentum are set to 0.8 and 0.2, respectively. For C4.5, we choose its default settings [23]. For the GP model, the detail design refers to Reference [24]. SAS system is used to perform LR experiments.
⎧n ∗ ⎫ f ( x ) = sgn ( g ( x )) = sgn ⎨∑ a i yi k ( xi , x ) + b ∗ ⎬ ⎩ i =1 ⎭
Fig.2. The procedures of HCSM. Then, we can classify the remaining data set according to the following discriminant function:
⎧ good credit ∀g ( x ) ≥ 0 f (x ) = ⎨ . ⎩bad credit ∀g ( x ) < 0
10
Authorized licensed use limited to: Xiamen University. Downloaded on December 26, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
Table 1 GP parameter settings Parameter Value Population size 1000 Fitness function ffi = ra-0.2re1-0.01re2 Function set {≤, =, ≥ ,AND,OR,NOT, IF…THEN…} Terminal set {Attribute,1,2,3,4,5,6,7,8,9,10} Maximum number 100 of generations Selection Lexicographic parsimony pressure [25] Crossover rate 0.9 Mutation rate 0.1 To provide a reliable estimate and minimize the impact of data dependency in developing behavior scoring models, k-fold cross-validation is used to generate random partitions of the credit data sets [26]. In this procedure, the credit data set is divided into k independent groups. HCSM, SVM, GP, LR, C4.5, and BPN are trained by using the (k-1) groups of samples and tested by using the remained group. This procedure is repeated until each of the groups has been used as a test set once. The overall scoring accuracy was reported as an average across all k groups. In this paper, the value of k was set to 5 and thus forms a 5fold cross-validation. Since the training of HCSM, GP, SVM, and BPN is a stochastic process, 10 iterations of the HCSM, GP, SVM and BPN in the same data sets, the final results in each group are an average of five best from 10 iterations. The classification accuracy rates for the Australian credit data set and German credit data set are obtained by using the six methods of HCSM, SVM, GP, LR, C4.5, and BPN, the results are summarized in Table 2 and 3 respectively, where AVG. denotes the average of the classification accuracy rates across 5 Groups. From the Table 2 and Table 3, we can observe that the classificatory accuracy rate of the HCSM is higher than other models. For the two data sets, however, we found SVM, GP, BPN and LR also perform well in this study and can be other alternatives for the credit scoring model, but C4.5 model was significantly inferior to the other five models. In addition, unlike ANN which is only suited for lager data sets [27], our proposed model can be useful in both large and small data sets. What is more, the HCSM can obtain the interesting and useful classification rules for the decision-maker.
1 2 3 4 5
89.86 90.72 89.42 88.84 88.41
87.39 87.97 86.96 85.51 89.42
88.84 87.10 86.96 87.82 90.28
84.49 86.81 85.22 85.51 87.68
85.07 85.51 86.96 86.96 86.52
87.39 87.83 89.28 84.93 86.96
AVG.
89.45
87.45
88.20
85.94
86.20
87.28
Table 3 Classification accuracy rate (%) on German data HCSM SVM GP C4.5 LR group 1 79.70 78.30 77.70 75.50 76.00 2 80.30 77.30 77.80 75.00 75.80 3 78.70 78.00 75.50 77.70 75.00 4 80.10 75.70 74.40 70.70 75.20 5 80.60 77.30 76.80 71.10 75.00
BPN 77.00 76.30 78.70 77.30 78.30
AVG.
79.88
77.32
76.44
74.00
75.40
77.52
5. Conclusions Credit scoring problems are one of the applications that have obtained serious attention over the past decades in financial institutions. Modeling techniques like traditional statistical analyses and artificial intelligence techniques have been proposed in order to tackle the credit scoring problems. In this paper, we develop the HCSM for the credit scoring problems. This model incorporates the advantage of the GP and SVM. We have compared our model with some conventional models. The experimental results show that HCSM obtains better classificatory accuracy in the credit scoring problems. In this paper, we mainly use demographic variables as independent variables; future studies may aim at collecting more important variables, to improve the credit scoring accuracies. Other related topics about credit data mining such as customer behavior scoring problems may also be investigated in future studies.
Acknowledgments The authors would like to thank the referees for their helpful comments and suggestions that help to improve this paper. This work was supported by the National Nature Science Foundation of China (Grant no. 60773126) and the Province Nature Science Foundation of Fujian (Grant no. A0710023) and academician start-up fund (Grant No. X01109) and 985 information technology fund (Grant No. 0000-X07204) in Xiamen University.
References Table 2 Classification accuracy rate (%) on Australian data HCSM group SVM GP C4.5 LR BPN
11
Authorized licensed use limited to: Xiamen University. Downloaded on December 26, 2008 at 02:24 from IEEE Xplore. Restrictions apply.
[15] Defu Zhang, Hongyi Huang, Qing-Shan Chen
[1]Thomas, Lyn C., Edelman, David B. Edelman, Jonathan N Crook. Credit Scoring and its Applications. Philadelphia: SIAM, 2002. [2]Davis RH, Edelman DB, Gammerman AJ. Machine-learning algorithms for credit-card applications. IMA Journal of Mathematics Applied in Business and Industry 1992; 4:43-52. [3]Henley, WE. Statistical aspects of credit scoring. Dissertation, The Open University, Milton Keynes, UK, 1995. [4]Tian-Shyug Leea, Chih-Chou Chiub, Yu-Chao Chouc, Chi-Jie Lud. Mining the Customer Credit Using Classification and Regression Tree and Multivariate Adaptive Regression Splines. Computational Statistics and Data Analysis 2006; 50:1113-1130. [5]West D. Neural network credit scoring models. Computers & Operations Research 2000; 27:1131-52. [6]Huang, Z Chen, H., Hsu, C.J., Chen, W.-H, Wu, S. Credit Rating Analysis with Support Vector Machines and Neural Network: A Market Comparative Study. Decision Support Systems 2004; 37(4):543-558. [7]Jianping Li, Jingli Liu, Weixuan Xu, Yong Shi. Support Vector Machines Approach to Credit Assessment. International Conference on Computational Science 2004; LNCS 3039:892-899. [8]Kin Keung Lai, Lean Yu, Shouyang Wang, Ligang Zhou: Neural Network Metalearning for Credit Scoring. ICIC (1) 2006: 403-408. [9]Kin Keung Lai, Ligang Zhou, Lean Yu: A TwoPhase Model Based on SVM and Conjoint Analysis for Credit Scoring. International Conference on Computational Science (2) 2007: 494-498. [10]De-Fu Zhang, Qing-Shan Chen, Li-jun Wei.
and Yi Jiang. A Comparison Study of Credit Scoring Models. IEEE Proceedings. ICNC 2007. [16]De Falco, A. Della Cioppa, E. Tarantino. Discovering interesting classification rules with genetic programming. Applied Soft Computing 2002; 1(3): 257-269. [17]Altman E I. Commercial bank lending: process, credit scoring, and costs of errors in lending. Journal of Financial and Quantitative Analysis 1980; 15:813-832. [18]Nguyen, H. Son. Discretization of real value attributes: Boolean reasoning approach. Dissertation, Warsaw University, Warsaw, 1997. [19]C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. [20] Chang, C. C., and Lin, C. (J.). LIBSVM: a library for support vector machines. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm. [21]Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., Wu, (S.). Credit rating analysis with support vector machine and neural networks: A market comparative study. Decision Support Systems 2004; 37:543–558. [22]Martin T. Hagan, Howard B. Demuth, Mark H. Beale. Neural network design. Beijing: China Machine Press, 2002. [23]Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. [24]Chorng-Shyong Ong, Jih-Jeng Huang, GwoHshiung Tzheng. Building credit scoring models using genetic programming. Expert Systems with Applications 2005; 19(1): 41-47. [25]Sean Luke, Liviu Panait. Lexicographic Parsimony Pressure. Proceedings of the Genetic and Evolutionary Computation Conference 2002; 829-836. [26]Nan-Chen Hsieh. Hybrid mining approach in the design of credit scoring models. Expert Systems with Applications 2005; 28:655–665. [27]R. Nath, B. Rajagopalan, R. Ryker. Determining the saliency of input variables in neural network classifiers. Computers & Operations Research 1997; 24(8): 767–73.
Building Behavior Scoring Model Using Genetic Algorithm and Support Vector Machines. Lecture Notes in Computer Science. 2007,4488: 482-485. [11]Qing-Shan Chen, De-Fu Zhang, Li-Jun Wei and Huo-Wang Chen A Modified Genetic Programming for Behavior Scoring Problem. 2007 IEEE Symposium on Computational Intelligence and Data Mining. 2007:535–539. [12]J.R. Koza. Genetic programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press, 1992. [13]K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 2001; 12(2):181-201. [14]Cristianini Nello, Shawe-Taylor John. An introduction to support vector machines and other kernel-based learning methods. New York: Cambridge University Press, 2000.
12
Authorized licensed use limited to: Xiamen University. Downloaded on December 26, 2008 at 02:24 from IEEE Xplore. Restrictions apply.