... intelligence (AI) tech- niques in helping bankers make loans, develop mar- ... plans to introduce a neural net software product for under $1000 .... A comparison of the neural network models with linear discriminant ... Desk et al./European ...
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH ELSEVIER
European Journal of Operational Research 95 (1996) 24-37
Theory and Methodology
A comparison of neural networks and linear scoring models in the credit union environment V i j a y S. D e s a i a, *, J o n a t h a n N . C r o o k b, G e o r g e A . O v e r s t r e e t , Jr. a a Mclntire School of Commerce, University of Virginia, Charlottesville, VA 22 903, USA b Department of Business Studies, University of Edinburgh, 50 George Square, Edinburgh, EH89JY, UK
Received January 1995; revised August 1995
Abstract
The purpose of the present paper is to explore the ability of neural networks such as multilayer perceptrons and modular neural networks, and traditional techniques such as linear discriminant analysis and logistic regression, in building credit scoring models in the credit union environment. Also, since funding and small sample size often preclude the use of customized credit scoring models at small credit unions, we investigate the performance of generic models and compare them with customized models. Our results indicate that customized neural networks offer a very promising avenue if the measure of performance is percentage of bad loans correctly classified. However, if the measure of performance is percentage of good and bad loans correctly classified, logistic regression models are comparable to the neural networks approach. The performance of generic models was not as good as the customized models, particularly when it came to correctly classifying bad loans. Although we found significant differences in the results for the three credit unions, our modular neural network could not accommodate these differences, indicating that more innovative architectures might be necessary for building effective generic models.
Keywords: Neural networks; Banking; Credit scoring
1. Introduction
Recent issues of trade publications in the credit and banking area have published a number of articles heralding the role of artificial intelligence (AI) techniques in helping bankers make loans, develop markets, assess creditworthiness and detect fraud. For example, HNC Inc., considered a leader in neural
* Corresponding author.
network technology, offers, among other things, products for credit card fraud detection (Falcon), automated mortgage underwriting (Colleague), and automated property valuation. Clients for H N C ' s Falcon software include AT & T Universal Card, Household Credit Services, Colonial National Bank, First U S A Bank, First Data Resources, First Chicago Corp., Wells Fargo & Co., and Visa International (American Banker, 1994a,b; 1993a,b). According to Allen Jost, the director of Decision Systems for HNC Inc., "Traditional techniques cannot match the fine resolution across the entire range of account profiles
0377-2217/96/$15.00 Copyright © 1996 Elsevier Science B.V. All rights reserved SSDI 0377-2217(95)00246-4
V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37
that a neural network produces. Fine resolution is essential when only one in ten thousand transactions are frauds" (Jost, 1993 p. 32). Other software companies marketing AI products in this area include Cybertek-Cogensys and Nestor Inc. Cybertek-Cogensys markets an expert system software called Judgment Processor which is used in evaluating potential borrowers for various consumer loan products, and includes customers such as Wells Fargo Bank, San Francisco, and Commonwealth Mortgage Assurance Co., Philadelphia (American Banker, 1993c,d); and plans to introduce a neural net software product for under $1000 (Brennan, 1993a, p. 52). Nestor Inc.'s customers for a neural network-based credit card fraud detection software include Mellon Bank Corp. (American Banker, 1993e). While acknowledging the success of expert systems and neural networks in mortgage lending and credit card fraud detection, reports in trade journals claim that artificial intelligence and neural networks have yet to make a breakthrough in evaluating customer credit applications (Brennan, 1993a). According to Mary A. Hopper, senior vice president of the Portfolio Products Group at Fair, Isaac and Co., a major provider of credit scoring systems, " T h e problem is a quality known as robustness, The model has to be valid over time and a wide range of conditions. When we tried a neural network, which looked great on paper, it collapsed - it was not predictive. We could have done better sorting the list on a single field" (Brennan, 1993b, p. 62). Typical techniques of choice for software marketed by the big-three credit information companies such as Gold Report, developed by Management Decision Systems (MDS) and marketed by TRW, Delphi developed by MDS and marketed by Trans Union, Delinquency Alert System developed by MDS and marketed by Equifax, Empirica developed by Fair Isaac for Trans Union, and Beacon developed by Fair Isaac for Equifax, include multivariate discriminant analysis (Gothe, 1990, p. 28), and regression (Jost, 1993, p. 27). In spite of the reports in the; trade journals indicated above, papers in academic journals investigating and reporting on the claims appearing in the trade journals are not common. Perhaps this is due to the lack of data available to the academic community. Exceptions include Overstreet et al. (1992) and Overstreet and Bradley (1994) who compare custom
25
and generic credit scoring models for consumer loans in a credit union environment using conventional statistical methods such as regression and discriminant analysis. Unlike large US banks, credit unions' loan files were not kept in readily available computerized databases. As a result, samples were laboriously collected by analyzing individual loan files. The purpose of the present paper is to use the rich database of three credit unions in the Southeastern United States assembled by Overstreet to investigate whether the predictive power of the variables employed in the above studies c a n be enhanced if the statistical methods of regression and discriminant analysis are replaced by neural network models. In particular, w e explore two types of neural networks, namely, feedforward neural networks with backpropagation of error commonly referred to as multilayer perceptrons ( M L P ) a n d modular neural networks (MNN). Neural networks can be viewed as a nonlinear regression technique. While there exist a number of nonlinear regression techniques, in a number of these techniques one has to specify the nonlinear model b e f o r e proceeding w i t h the estimation of parameters; hence these techniques can be classified as model-driven approaches. In comparison, use of a neural network is a data-driven approach, i.e,, a prespecification o f the model is n o t required. The neural network: qearns' the relationships inherent in the data presented to it. This approach seems particularly attractive in solving t h e p r o b l e m a t hand, because, a s Allen Jost (1993, p, 30)says, "Traditional statistical model development includes time-consuming manual data review activities such as searching for non-linear relationships and detecting interactions among predictor variables". I n the credit: union environment, funding and small sample size often preclude the use of credit scoring models that are custom tailored for the:individual credit union. A recent Filene Research Institute survey found that only 20% of Credit unions with over $25 million in assets have a credit scoring system. Among the vast number of smaller credit unions, credit scoring usage can be expected to be even lower. In light of this, generic models appear to offer a potential avenue from which credit scoring could be feasible for even the smallest institution within this industry. To this end, we train and compare the performance of customized and generic models so
26
V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37
that the costs and benefits of the two approaches can be evaluated. Proponents of customized models argue that credit unions enjoy very narrow fields of membership, normally employment related, and hence individual credit unions differ markedly from each other. For example, one credit union included in the present study consists of teachers whereas another one consists of telephone company employees. Thus, generic models would miss important differences among individual credit unions. If indeed there are differences in the data representing the different credit unions, one way to detect these differences and take advantage of them would be to use modular neural networks, a network architecture consisting of a multiplicity of networks competing with each other to learn various segments of the input data space. Thus, we train and compare the performance of modular neural network models with the generic and customized neural networks as well as models based upon linear discriminant analysis and logistic regression. A comparison of the neural network models with linear discriminant analysis and logistic regression suggests that, in terms of correctly classifying good and bad loans, the neural network models outperform linear discriminant analysis, but are only marginally better than logistic regression models. However, in terms of correctly classifying bad loans, the neural network models outperform both conventional techniques. Since bad loans are only a small proportion of the total loans made, this result resonates with the claim made by Allen Jost of HNC Inc. that traditional techniques cannot match the fine resolution produced by neural nets. In comparing generic and customized models we found that the customized models perform significantly better than generic models. However, the performance of modular neural networks was not significantly better than the generic neural network model, suggesting that perhaps the differences between the individual credit unions are not that important. In Section 2 and Section 3 we review conventional statistical techniques and neural network models respectively; Section 4 describes the data, the sources of data, and provides the specifics of the neural networks used; Section 5 presents the results of our experiments, and Section 6 presents the conclusions and suggestions for future research.
2. Conventional statistical techniques Let the vector x = ( x l, x 2 Xp) represent the p predictor variables, and let y represent the binary (categorical) dependent variable. Let the predictor variables be metric or nonmetric. Given that the dependent variable is binary, conventional methods typically used are as follows. . . . . .
Linear Discriminant Analysis (LDA) The objective of linear discriminant analysis is to deliver a function Z = wtx
= W l X 1 -~- w 2 x
2 -~ • • • - ~ W p X p ,
(1)
where the weight vector w = ( w 1, w e . . . . . Wp) is such that it maximizes the ratio -
(2)
H e r e / z I and /z 2 are the population mean vectors for the two categories, and ~ is the common covariance matrix for the two populations. The intuition behind this is that, if the difference between the weighted mean vectors is maximized relative to their common covariance, the risk of misclassification would be relatively small. The linear discriminant model assumes that 1) the predictor variables are measured on an interval scale; 2) the covariance matrices of the predictor variables are equal for the two groups; and 3) the predictor variables follow a multivariate normal distribution. As will be clear from Section 5, the predictor variables used in our credit union application are not all metric, and hence the first and third assumptions are clearly violated. It is well known that when predictor variables are a mixture of discrete and continuous variables, the linear discriminant analysis function may not be optimal, and special procedures for binary variables are available (e.g., Dillon and Goldstein, 1984). However, in the case of binary variables, most evidence suggests that the linear discriminant function performs reasonably well (e.g., Gilbert, 1968; Moore, 1973). Logistic Regression (LR) In the case of logistic regression it is assumed that the following model holds: 1 Z = 1 + e -z '
(3)
VS. Desk
et al./European
Journal
where Z = The probability of class outcome. z = wa + w,x, + wzx* + . . . +w,x,. The logistic regression model does not require the assumptions necessary for the linear discriminant problem. In fact, Harrell and Lee (1985) found that even when the assumptions of LDA are satisfied, LR is almost as efficient as LDA. One advantage of LDA is that ordinary least-squares procedures can be used to estimate the coefficients of the linear discriminant function, whereas maximum-likelihood methods are required for the estimation of a logistic regression model. However, given the availability of high speed computers today, computational simplicity is no longer considered to be an adequate criterion for choosing a method. A second advantage of discriminant analysis over logistic regression is that prior probabilities and misclassification costs can be easily incorporated into the LDA model. Misclassification costs and prior probabilities can also be incorporated into neural networks (e.g., Tam and Kiang, 1992); however, since we want the results of the three methods to be comparable, the present study does not incorporate misclassification costs into the models studied. The LDA model did use the proportion of good and bad loans in the training samples as prior probabilities.
3. Neural networks Research on multilayer feedforward networks dates back to the pioneering work by Rosenblatt (1962) and Widrow (1962). A computationally efficient method for training multilayer feedforward networks came in the form of the backpropagation algorithm. Credit for developing backpropagation into a usable technique, as well as popularizing it, is usually given to Rumelhart et al. (1986) and the members of the Parallel Distributed Processing group, although a number of independent origins including Parker (19821, Werbos (19741, Bryson and Ho (19691, and Robbins and Munro (1951) have been cited. A neural network model takes an input vector X and produces an output vector 0. The relationship between X and 0 is determined by the network architecture. There are many forms of network architectures (inspired by the neural architecture of the
ofOperational Research 95 (1996) 24-37 Output
27 n
Hidden
Input
Multilayer Perceptron Fig.
1. Multilayer perceptron.
brain). The network generally consists of at least three layers, one input layer, one output layer, and one or more hidden layers. Fig. 1 illustrates a network with one hidden layer. “If a complex function is naturally decomposable into a set of simpler functions, then a modular network has the built-in ability to discover the decomposition’ ’ (Haykin, 1994). While there exist a number of architectures consisting of a multiplicity of networks (e.g., Nilsson, 1965), the modular neural network used in the present study is based upon the architecture presented in Jacobs et al. (1991), and consists of a group of feedforward neural networks (referred to as ‘local experts’) competing to learn different aspects of the problem. A gating network controls the competition and learns to assign different regions of the data space to different local expert networks. Fig. 2 illustrates a network with three local experts and a gating network.
output
0
Modular Neural Network Fig. 2. Modular neural network.
Gating Network
V.S. Desai et a l . / European Journal of Operational Research 95 (1996) 24-37
28
3.1. Network architecture 3.1.1. Multilayer perceptron (MLP) Each layer in an MLP consists of one or more processing elements ('neurons'). In the network we will be using, the input layer will have p processing elements, i.e., one for each predictor variable. Each processing element in the input layer sends signals x i (i = 1..... p) to each of the processing elements in the hidden layer. Each processing element in the hidden layer (indexed by j = 1 . . . . . q) produces an 'activation'
trois the competition and learns to assign different regions of the data space to different networks. Both the local experts and the gating network have full connections from the input layer. The gating network has as many output nodes as there are local experts, and the output values of the gating network are normalized to sum to one. Let t represent the number of local experts, g,, represent the output of the m-th neuron of the gating network, u m the weighted sum of the inputs applied to the m-th output neuron of the gating network, and o,, the output of the m-th local expert, and o denote the output of the whole MNN. Then,
a j = G ( ~i wijxi )
t
O=
where wij are the weights associated with the connections between the p processing elements of the input layer and the j-th processing element of the hidden layer. The processing element(s) in the output layer behave in a manner similar to the processing elements of the hidden layer to produce the output of the network
ok = F
G ~_.wijx i w~k
E
groOm •
Thus, the final output vector of the MNN is a weighted sum of the output vectors of the local experts. The outputs of the gating network, gl, g2 ..... gt, are interpreted as the conditional a priori probabilities that the respective local experts generate the current training pattern. Thus, it is required that t
where k = l . . . . . r.
O