Assessment of Model Development Techniques and Evaluation ...

2 downloads 0 Views 314KB Size Report
loans, auto loans, interest-only loans, etc. In 1980, there was $55.1 billion in outstanding unsecured revolving credit in the U.S. In 2000, that number had risen to ...
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry Satish Nargundkar and Jennifer Lewis Priestley Department of Management and Decision Sciences, J. Mack Robinson College of Business Administration, Georgia State University, Atlanta, GA 30303 U.S.A. Submitted to Decision Sciences Journal on January 31, 2003 Please do not cite without express written permission from the authors

Abstract Binary classification models are a common feature in the credit industry today. They are used in a variety of applications for classification and prediction, including the likelihood of response, risk, bankruptcy, etc. Logistic Regression is the most commonly used technique, followed by other techniques such as Linear Discriminant Analysis and, increasingly, Artificial Neural Networks. In this paper, we examine these techniques empirically to determine if one clearly outperforms the others. K-S Tests and Classification Rates are typically used in the industry to measure the success in classification. We examine those two methods and a third, ROC Curves, to determine if the method of evaluation has an influence on the perceived performance of the modeling technique. We found that each modeling technique has its own strengths, and a determination of the “best” depends upon the evaluation method utilized and the costs associated with misclassification. As a result, researchers, analysts and decision makers must possess some knowledge of the context of the data before selecting classification technique and evaluation method. Subject Areas: Classification Models, Model Evaluation and Credit Scoring.

2

Introduction The credit industry has experienced decades of rapid growth as characterized by the ubiquity of consumer financial products such as credit cards, mortgages, home equity loans, auto loans, interest-only loans, etc. In 1980, there was $55.1 billion in outstanding unsecured revolving credit in the U.S. In 2000, that number had risen to $633.2 billion. However, the number of bankruptcies filed per 1,000 U.S. Household increased from 1 to 5 over the same period1.

The popularity of consumer credit products represents both a risk and an opportunity for credit lenders. In an effort to maximize the opportunity to attract, manage, and retain profitable customers and minimize the risks associated with potentially unprofitable customers, lenders have increasingly turned to credit scoring models to facilitate a holistic approach to Customer Relationship Management (CRM). In the consumer credit industry, the general model for CRM includes product planning, customer acquisition, customer management, collections and recovery - see Figure 1. Scoring models have been used extensively to support each stage of this general CRM strategy.

3

Figure 1: Stages of Customer Relationship Management in Credit Lending

Other Models Segmentation Models Bankruptcy Models Fraud Models Collections/ Recovery

Collections Recovery Models

Product Planning

Creating Value

Customer Management

Target Marketing Response Models Risk Models

Customer Acquisition

Behavioral Models Rollover Churn Activation

For example, customer acquisition in credit lending is often accomplished through modeldriven target marketing. Data on potential customers, which can be accessed from credit bureau files and a firm’s own databases, is used to identify those customers most likely to respond to a solicitation. In addition, risk models are utilized to support customer acquisition efforts through reducing the risks associated with customer default. Once customers are acquired, customer management strategies require careful analysis of behavior patterns. Behavioral models are developed using a customer’s transaction history to predict and classify customers who may default or attrite. Collections and recovery is a ubiquitous stage in a credit lender’s CRM strategy. In this stage, lenders develop models to prioritize their recovery efforts. Other models used by lenders to support the overall CRM strategy may involve bankruptcy prediction, fraud prediction and market segmentation.

4

Not surprisingly, the central concern of modeling applications in each of these stages is increasing profitability by improving the scoring accuracy of the model in question. An improvement in accuracy of even a fraction of a percent can translate into significant savings or increased revenue. As a result, many different modeling techniques have been developed and refined. These techniques include both statistical (e.g., Linear Discriminant Analysis, Logistic Analysis) and non-statistical techniques (e.g., Decision Trees, k-Nearest Neighbor, Cluster Analysis, Neural Networks). Each technique utilizes different assumptions and may or may not achieve similar results based upon the context of the data. Because of the growing importance of accurate scoring models, an entire literature exists which is dedicated to the development and refinement of classification models. However, developing the model is really only half the problem.

Researchers and analysts allocate a great deal of time and intellectual energy to the development of classification models to support decision-making. However, too often insufficient attention is allocated to the tool(s) used to evaluate the model(s) in question. The result is that excellent models may be measured inappropriately based upon the information available regarding classification error rate and the context of application. An analyst may develop multiple models to address a problem domain, but use an incorrect evaluation method to select a “winning” model. The result may be the selection of a suboptimal model, which leads to less accurate classification than may have been possible with a more appropriate evaluation method. In the end, poor decisions are made, because the wrong model was selected, using the wrong evaluation method.

5

This paper addresses the dual issues of model development and evaluation. Specifically, we attempt to answer the questions, “Does model development technique impact classification accuracy?” And “How will model selection vary based upon the selected evaluation method?”

The remainder of the paper will be organized as follows. In the next section, we give a brief overview of three of the most commonly used techniques for binary classification in the credit industry. We then discuss the conceptual differences among three methods of model evaluation and provide rationales for when they should and should not be used. We illustrate model application and evaluation through an empirical example using the techniques and methods described in the paper. Finally, we conclude the paper with a discussion of our results and concepts for further research.

Common Modeling Techniques As mentioned above, modeling techniques can be roughly segmented into two classes: statistical and non-statistical.

Borrowing from theoretical statistics, the various methods for model development can be further categorized into two general types – the diagnostic paradigm and the sampling paradigm. Simply stated, the diagnostic paradigm uses the information from the design set to estimate the conditional probability of an observation belonging to each class, based upon predictor variables. The sampling paradigm estimates the distribution of the predictor variables separately for each class and combines these with the prior

6

probabilities of each class occurring. Then, using Bayes theorem, the posterior probabilities of belonging to each class are calculated (Hand, 2001).

The first technique we utilized for the empirical analysis for this paper, linear discriminant analysis (LDA), is one of the earliest formal modeling techniques based on the sampling paradigm. LDA has its origins in the discrimination methods suggested by Fisher (1936). In its original form, this process defined a classification in terms of the predictor variables, which maximized the separability between the classes, and then defined the probability of belonging to each class. This is easily seen to produce the same results as a method based on assuming that the two classes have multi-variate normal distributions with identical covariance matrices – hence the sampling paradigm interpretation (Hand, 2001). The appropriateness of LDA for credit scoring has been questioned because of the categorical nature of some credit data and the fact that the covariance matrices are rarely equal. However, the inequality of covariance matrices, as well as the non-normal nature of the categorical data, which is common to credit scoring, may not be critical limitations of the technique (Reichert et al., 1983). Although it is one of the simpler modeling techniques, LDA continues to be widely used in practice.

The second technique we utilized for this paper, logistic regression analysis, is one of the earliest methods based upon the diagnostic paradigm. For the binary classification problem, this is achieved by taking a linear combination of the descriptor variables and transforming the result to lie between 0 and 1, to equate to a probability. Today, logistic regression analysis is considered one of the most common techniques of model development for credit decisions (Thomas, et al., 2002).

7

Where LDA and logistic analysis are statistical classification methods with lengthy histories, neural network-based classification is a non-statistical technique, which has developed as a result of improvements in desktop computing power. Neural networks were developed from attempts to model the processing functions of the human brain. In the brain, dendrites carry electrical signals to a neuron, which converts the signals into pulses of electricity sent along an axon to a number of synapses, which relate information to the dendrites of other neurons. Analogous to the brain, a neural network consists of inputs or variables, each of which is multiplied by a weight, analogous to a human dendrite. The products are then summed and transformed in a “neuron” and the result becomes an input value for another neuron (Thomas, et al., 2002).

The most common form of neural network is the multilayer perceptron (MLP), which consists of an input layer of signals, an output layer of signals and a number of layers of neurons in between, called hidden layers.

The calculation of the vector of weights in the neural network is known as “training” the network, where the most common mode of neural network training is referred to as “back propogation”. Using this training mode, all weights are initially set to some randomly selected seed numbers. A training pair is selected and the input values are used to calculate a predicted value for each observation. This is considered a “forward pass”. The backward pass, indicative of back propogation, consists of distributing the error back through the network in proportion to the contribution made by each weight and then adjusting the weights accordingly. A second training pair is then selected and the forward

8

and backward passes are repeated. This is continued across all cases. When all cases have been evaluated, one “epoch” is said to have been completed. The whole procedure is repeated until a pre-established stopping criterion is reached (Thomas, et al., 2002). For a binary classification problem, only one output node is required. The output values from the neural network are then used for classification.

Neural networks have been studied extensively as a useful tool in many financial service applications. For example, West (2000) reports that neural network models are used successfully by American Express for credit card fraud detection, by Lloyd’s Motor Finance for automobile financing decisions, and Security Pacific Bank for scoring small business loan applications.

Although dozens of statistical and non-statistical model development techniques exist, these three techniques were selected for use in this paper because of their unique respective positions in successful credit scoring; LDA is the oldest and the second most widely used of the three techniques, logistic analysis is the most common technique in use, and neural networks are continuing to emerge and have begun to demonstrate success and, in some cases, superiority, over the more traditional techniques (Zhang, et al., 1999).

In the next section, we will explore three methods used to evaluate classification models: global classification rates, the Kolmorogov-Smirnov test, and ROC curves. Each evaluation method utilizes different information regarding classification error costs. As a result, selection of “best” model will vary based upon the evaluation method selected.

9

Methods of Model Evaluation As stated in the previous section, a central concern of modeling techniques is an improvement in accuracy. In credit scoring, an improvement in accuracy of even a fraction of a percentage can translate into significant savings. However, how can the analyst know if one model represents an improvement over a second model? The answer to this question may change based upon the selection of evaluation method. Therefore, analysts who utilize classification models, have a need to understand the circumstances under which each evaluation method is most appropriate.

Prior to a detailed discussion of evaluation methods, some of the issues related to model evaluation should first be addressed. Hand (2001) identifies three primary issues for consideration: individual versus population classification, conditional versus unconditional model assessment, and absolute versus comparative performance. The first two issues, although not trivial, can be resolved relatively easily. The third issue, specifically, comparative performance, represents the primary focus of this paper. Each of these three issues will be explored in turn.

The first issue is related to the difference between individual measures and population measures. Individual measures refer to the performance of a modeling technique in assigning an individual to a class. In contrast, population measures evaluate the overall classification rate for a model in classifying the entire population into distinct categories. For example, in the field of medicine, a common classification objective is to determine individuals who have (or will have) a particular disease, where having the disease is

10

unlikely. In this example, if the objective is individual assignment, every individual would be assigned to the non-disease class, because this is the highest probability of membership. However, because some small portion of the population will have the disease, and this assignment is arguably the more valuable of the two, the individual observations cannot be assigned based upon the class to which they have the highest probability of membership. Rather, those individuals who have the largest probability of belonging to the disease class, regardless of how small, should be assigned to that class.

The second issue involves the difference between conditional and unconditional performance. This distinction is subtle, but important when evaluating models. Conditional performance refers to a classification model constructed using a specific data set, while unconditional performance describes the procedure for constructing the model. For example, if an analyst has a defined data set, which will be used to develop a model, he will generally want to know the conditional performance – Given this dataset, how well do these models perform? Alternatively, if the goal is to make general statements about which classification method is superior, then unconditional (or absolute) performance, not based upon a given data set, is appropriate. Further differentiating between the two performance types, Hand (2001) explains that, “Conditional performance will be determined using a chosen measure (e.g., classification rate…) while unconditional performance will be determined in terms of the statistical properties of a chosen measure over possible design sets which could have been drawn.”

The third issue related to model evaluation is that between absolute and comparative performance. An absolute evaluation compares each model to some exogenous metric

11

(e.g., the model cannot misclassify more than 20% of all individuals). Comparable performance tells us whether model A is better than model B, where “better” is clearly defined by the decision maker or analyst. However, the answer to which is “better” may not be straightforward. Our concern is with evaluating the comparable performance of models using the three modeling techniques described above.

In the context of binary classification models, one of four outcomes is possible: (i) a true positive – e.g., a good credit risk is classified as “good”; (ii) a false positive – e.g., a bad credit risk is classified as “good”; (iii) a true negative – e.g., bad credit risk is classified as “bad”; (iv) a false negative – e.g., a good credit risk is classified as “bad”. The N-class prediction models are significantly more complex and outside of the scope of this paper. For an examination of the issues related to N-class prediction models, see Taylor and Hand (1999).

In principle, each of these outcomes would have some associated “loss” or “reward”. For example, in the medical field, a true positive “reward” might be an individual with a disease, being given the information that the disease is present and seeking medical attention appropriately. A false negative “loss” (or misclassification cost) might be the same diseased patient not receiving the information, and possibly dying from a treatable condition. In a credit lending context, a true positive “reward” might be a qualified person obtaining a needed mortgage with the bank reaping the economic benefit of making a correct decision. A false negative “loss” might be the same qualified person being turned down for a mortgage. In this instance, the bank not only has the opportunity cost of losing a good customer, but also the possible cost of increasing its competitor’s business.

12

In principle, it is often assumed that the two types of incorrect classification – false positives and false negatives – incur the exact same loss (Hand, 2001). If this is truly the case, then a simple “global” classification rate could be used for model evaluation. For example, a hypothetical mortgage classification model produced the following confusion matrix: True Good True Bad

Total

Predicted Good

650

50

700

Predicted Bad

200

100

300

850

150

1000

This model would have a global classification rate of 75% (650/1000 + 100/1000). This simple metric is reasonable if the costs associated with each error are known (or assumed) to be the same. If this were the case, the selection of a “better” model would be easy – the model with the highest classification rate would be selected. Even if the costs were not equal, but at least understood with some degree of certainty, the total loss associated with the selection of one model over another could still be easily evaluated based upon this confusion matrix.

For example, the projected loss associated with use of a particular model can be represented by the loss function:

L=π0f0c0+ π1f1c1

13

(1)

where πi is the probability that an object comes from class i (the prior probability), fi is the probability of misclassifying a class i object, and ci is the cost associated with misclassifying an observation into that category and, for example, 0 indicates a “bad” credit risk and 1 indicates a “good” credit risk. Parameter estimation, model selection, and performance assessment can be based on minimizing this function (Hand, 1999).

However, rarely in credit scoring, or any other field, is perfect information about the costs of classification errors available.

A second issue when using a simple classification matrix for evaluation is the problem that can occur when evaluating models dealing with rare events – the first issue outlined in the beginning of this section. If the prior probability of an occurrence is very high, the model would achieve a strong classification rate if all observations were simply classified into this class. However, when a particular observation has a low probability of occurrence (e.g., cancer, bankruptcy, tornadoes, etc.), it is far more difficult to assign these low probability observations into their correct class (where possible, this issue of strongly unequal prior probabilities can be addressed during model development or network training by contriving the two classes to be of equal size, but this may not always be an option). The difficulty of accurate rare class assignment is not captured if the simple global classification is used as an evaluation method (Gim, 1995). Because of the issue of rare events and imperfect information, the simple classification rate should very rarely be used for model evaluation. However, a quick scan of papers which evaluate different modeling techniques will reveal that this is the most frequently utilized (albeit weakest due to the assumption of perfect information) method of model evaluation.

14

One of the most common methods of credit scoring model evaluation in practice is the Kolmogorov-Smirnov statistic or K-S test. The K-S test measures the distance between the distribution functions of the two classifications (e.g., good credit risks and bad credit risks). The score which generates the greatest separability between the functions is considered the threshold value for accepting or rejecting a credit application. The model producing the greatest amount of separability between the two distributions would be considered the superior model. A graphical example of a K-S test can be seen in Figure 2. In this illustration, the greatest separability between the two distribution functions occurs at a score of approximately .7. Using this score, if all applicants who scored above .7 were accepted and all applicants scoring below .7 were rejected, then approximately 80% of all “good” applicants would be accepted, while only 32% of all “bad” applicants would be accepted. The measure of separability, or the K-S test result would be 48% (80%-32%).

Greatest separation of distributions occurs at a score of .7.

Figure 2 K-S Test Illustration

Cumulative Percentage of Observations

100% 80% 60%

Good Accounts

40%

Bad Accounts

20% 0%

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Score Cut Off

15

Hand (2002) criticizes the K-S test for many of the same reasons outlined for the simple global classification rate. Specifically, the K-S test assumes that the relative costs of the misclassification errors are equal or that only one misclassification cost is relevant. In industry practice, this may be a partially valid assumption – analysts developing scoring models for credit lenders generally have an understanding of the costs associated with an account that eventually does not pay (e.g., a false positive). The opportunity losses associated with a false negative (e.g., not accepting a customer who would have paid on time) are generally not considered although could be known. As a result, the K-S test does not incorporate relevant information regarding the performance of classification models – the misclassification rates and their respective costs. The measure of separability, then becomes somewhat hollow.

In some instances, the researcher may not have any information regarding costs of error rates, such as the relative costs of one error type versus another. In almost every circumstance, one type of misclassification will be considered to be more serious than another. However, a determination of which error is the more serious is generally less well defined or may even be in the eye of the beholder. For example, returning to the field of medicine, is a “worse” mistake a false negative, where a diseased individual is told that they are healthy and therefore may not seek a needed treatment? Or is a worse mistake a false positive, where a healthy individual is told they have a disease that is not present and seek unnecessary treatment as well as experience unnecessary emotional distress? In a highly competitive business is a worse mistake to turn away a potentially valuable customer to a competitor or accept a customer that does not meet financial expectations?

16

The answers are not always straightforward. The cost function outlined above, may not be applicable.

One method of evaluation, which enables a comprehensive analysis of all possible error severities is the ROC curve. The “Receiver Operating Characteristics” curve was first applied to assess how well radar equipment in WWII distinguished random interference or “noise” from the signals which were truly indicative of enemy planes (Swets, et al., 2000). ROC curves have since been used in fields ranging from electrical engineering and weather prediction to psychology and are used almost ubiquitously in the literature on medical testing to determine the effectiveness of medications. The ROC curve plots the sensitivity or “hits” (e.g., true positives) of a model on the vertical axis against 1specificity or “false alarms” (e.g., false positives) on the horizontal axis. The result is a bowed curve rising from the 45 degree line to the upper left corner – the sharper the bend and the closer to the upper left corner, the greater the accuracy of the model (see Figure 3). Figure 3: ROC Curve Illustration

Sensitivity (Probability of True Positives)

1

0

1-Specificity (Probability of False Positives)

17

1

The area under the ROC curve is a convenient way to compare different classification models when the analyst or decision maker has no information regarding the costs or severity of classification errors. This measurement is equivalent to the Gini index (Thomas et al., 2002) and the Mann-Whitney-Wilcoxon test statistic for comparing two distributions (Hanley and McNeil, 1982, 1983) and is referred in the literature in many ways, including “AUC”, the c-statistic, and “θ” (we will use the “θ” term for the remainder of this paper to describe this area). For example, if observations were assigned to two classes at random, such that there was equal probability of assignment in either class, the ROC curve would follow a 45-degree line emanating from the origin. This would correspond to θ = .5. A perfect binary classification, θ=1, would be represented by an ROC “curve” that followed the y-axis from the origin to the point 0,1 and then follow the top edge of the square.

The metric θ can be considered as an averaging of the misclassification rates over all possible choices of the various classification thresholds. In other words, θ is an average of the diagnostic performance of a particular model over all possible values for the relative misclassification severities (Hand, 2001). The interpretation of θ, where a “good” credit risk is scored as a 1 and a “bad” credit risk is scored as a 0, is the answer to the question – “Using this model, what is the probability that a truly “good” credit risk will be scored higher than a “bad” credit risk”? Formulaically, θ can be represented as,

θ = ∫F(p|0)dF(p|1)dp,

18

(2)

where F(p|0) is the distribution of the probabilities of assignment in class 0 (classification of “bad” credit risk) and F(p|1) is the distribution of the probabilities of assignment in class 1 (classification of “good” credit risk).

The ROC curve provides an evaluation of a model based upon the losses associated with any choice of classification threshold and any ratio of misclassification costs. When no prior information regarding error severities is available, ROC curves are highly appropriate. However, it is rare in practice that nothing is known about the relative cost or severity of misclassification errors. And, in addition, rarely are all threshold values relevant.

Other evaluation methods exist in addition to those outlined in this section, including the Mahalanobis distance, Chernoff distance, Lissack-Fu distance, t-statistics and others. However the three methods outline above represent the most commonly used in practice and referenced in academic literature. In addition, these three methods also highlight the primary issues associated with all evaluation methods. No method represents a panacea for model evaluation.

In this section, we have outlined three primary issues related to model evaluation. The third issue, comparative performance, represents the most difficult to resolve and the focus of our work. Specifically, our research attempts to answer two questions:

1. Does model development technique impact classification accuracy? 2. How will model selection vary based upon the selected evaluation method?

19

To address these questions, we developed models using the three modeling techniques, linear discriminant analysis, logistic analysis and neural networks, and evaluated these models using the three evaluation methods – classification rates, ROC Curves and the K-S test.

Methodology This section will define the data that was used in our analysis and provide a detailed description of the model development process.

A real world data set was used to test the predictive accuracy of three scoring models, consisting of data on 14,042 applicants for car loans in the United States. The data represents applications made between June 1st, 1998, and June 30th, 1999. For each application, data on 65 variables were collected. These variables could be categorized into two general classes – data on the individual (e.g., other revolving account balances, whether they rent or own their residence, bankruptcies, etc) and data on the transaction (e.g., miles on the vehicle, vehicle make, selling price, etc). A complete list of all variables is included in Appendix A. These variables represent both continuous and categorical data. From this dataset, 9,442 individuals were considered to have been creditworthy applicants and 4,600 were considered to have been not creditworthy, on the basis of whether or not their accounts were charged off as of December 31st, 1999. No confidential information regarding applicants’ names, addresses, social security number, or any other data elements that would indicate identity were used in this analysis.

20

The data that was provided to us had already been “cleaned” by the credit lender. Because “dirty” data leads to poor models and poor decision-making, data cleansing is an important, although often overlooked step in model development. Data often comes from multiple sources in multiple formats and may include data entry errors. For a description of accepted data cleansing processes, see (Kimball, 1996).

An examination of each variable relative to the dependent variable (creditworthiness) found that most of the relationships were non-linear. For example, as can be seen in Figure 4, the relationship between the number of auto trades (i.e., the total number of previous or current auto loans), and an account’s performance is not linear; the ratio of “good” performing accounts to bad performing accounts increases until the number of auto trades reaches the 7-12 range. However, after this range, the ratio decreases.

Figure 4 Ratio of Good:Bad Account Performance by Total Number of Auto Trades 2.6 Ratio of Good:Bad Accounts

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0

1 to 6

7 to 12

Total Number of Auto Trades

21

13 to 18

The impact of this non-linearity on model development is clear – the overall classification accuracy of the model would decrease if the entire range of 0-18+ auto trades was included as a single variable. In other words, the model would be expected to perform well in some ranges of the variable but not in others. To address this issue, we ran frequency tables for each variable (both continuous and categorical) and created multiple dummy variables for each original variable. Using the auto trade variable above, we converted this variable into 4 variables – 0 auto trades, 1-6 auto trades, 7-12 auto trades and 13+ auto trades. We then assigned a value of “1” if the observation fell into the specified range and a value of “0” if they did not.

Prior to analysis, the data was divided into a testing file, representing 80% of the data set and a validation file, representing 20% of the data set. The LDA and logistic analysis models were developed using the SAS system (v.8.2), using the PROC REG (when all of the variables are dummies, the processes PROC REG and PROC DISCRIM generate the same results) and PROC LOGISTIC commands, respectively.

There are currently no established guiding principles to assist the analyst in developing a neural network model. Since many factors including hidden layers, hidden nodes, training methodology can affect network performance, the best network is generally developed through experimentation – making it somewhat more art than science (Zhang, et al., 1999).

Using the basic MLP network model, the inputs into our classification networks were simply the same predictor variables utilized for the LDA and logistic regression models

22

outlined above. Although non-linearity is not an issue with neural network models, using the dummy variable data versus the raw data eliminated issues related to scaling (we did run the same neural network models with the raw data, with no material improvement in classification accuracy). Because our classification networks were binary, we required only a single output layer. The selection of the number of hidden nodes is effectively the “art” in neural network development. Although some heuristics have been proposed as the basis of determining the number of nodes a priori (e.g., n/2, n, n+1, 2n+1), none have been shown to perform consistently well (Zhang, et al., 1999). To see the effects of hidden nodes on the performance of neural network models, we use 10 different levels of hidden nodes ranging from 5 to 50, in increments of 5, allowing us to include the effects of both small and larger networks. Backpack® v. 4.0 was used for neural network model development.

We split our original testing file, which was used for the LDA and logistic model development into a separate training file (60% of the complete data set) and a testing file (20% of the complete data set). The same validation file used for the first two models was also applied to validation of the neural networks. Because neural networks cannot guarantee a global solution, we attempted to minimize the likelihood of being trapped in a local solution through training the network 100 times using epochs (e.g., the number of observations from the training set presented to the network before weights are updated) of size 12 with 200 epochs between tests.

23

Results Table 1 summarizes the results for the different modeling techniques using the three model evaluation methods. As expected, selection of a “winning” model is not straightforward. Specifically, model selection will vary based upon the two main issues highlighted above – the costs of misclassification errors and the problem domain.

Table 1 Comparison of models using multiple methods of evaluation Modeling Technique

Classification Rate Theta4 1 2 3 Goods Bads Overall

Linear Discriminant Analysis

73.91%

43.40%

59.74%

68.98%

19%

Logistic Regression

70.54%

59.64%

69.45%

68.00%

24%

5 Hidden Node

63.50%

56.50%

58.88%

63.59%

38%

10 Hidden Nodes

75.40%

44.50%

55.07%

64.46%

11%

15 Hidden Nodes

60.10%

62.10%

61.40% 65.89%

24%

20 Hidden Nodes

62.70%

59.00%

60.29%

65.27%

24%

25 Hidden Nodes

76.60%

41.90%

53.78%

63.55%

16%

30 Hidden Nodes

52.70%

68.50%

63.13%

65.74%

22%

35 Hidden Nodes

60.30%

59.00%

59.46%

63.30%

22%

40 Hidden Nodes

62.40%

58.30%

59.71%

64.47%

17%

45 Hidden Nodes

54.10%

65.20%

61.40%

64.50%

31%

50 Hidden Nodes

53.20%

68.50%

63.27%

65.15%

37%

K-S Test

Neural Networks:

1. 2. 3.

The number of “good” applications correctly classified as “good”. The number of “bad” applications correctly classified as “bad”. The overall correct global classification rate. 4. The area under the ROC Curve.

24

If the misclassification costs are known with some confidence to be equal, the global classification rate could be utilized as an appropriate evaluation method. Using this method, the logistic regression model outperforms the other models, with a global classification rate of 69.45%. In other words, a decision maker selecting this model would make an error 31% of the time, with no consideration of error type, severity or cost of misclassification.

If costs are known with some degree of certainty, a “winning” model could be selected based upon the classification rates of “goods” and “bads”. For example, if a false negative error (e.g., classifying a true good as bad) is considered to represent a greater misclassification cost than a false positive (e.g., classifying a true bad as good), then the neural network 25 hidden nodes would represent the preferred model. This model generated an accurate prediction of “good” credit applicants 76.60% of the time. Alternatively, if a false positive error is considered to represent a greater misclassification cost, then the neural networks with 30 or 50 hidden nodes would be selected. These models generated an accurate prediction of “bad” credit applicants 68.50% of the time (in the credit industry, the false positive error is traditionally considered to represent the more costly of the two misclassification errors).

If the analyst is most concerned with the models’ ability to provide a separation between the scores of good applicants and bad applicants, the K-S test is the traditional method of model evaluation. Using this test, the neural network with 5 hidden nodes would be selected. The greatest point of separation between the distributions of “good” applicant scores and “bad” applicant scores, occurs at .7. At this score, almost 100% of the “goods”

25

would be accepted, but only 60% of the “bads” would be accepted, resulting in a separation between the two distributions of almost 40%.

The last method of evaluation assumes the least amount of available information. The θ measurement represents the integration of the area under the ROC curve and accounts for all possible iterations of relative severities of misclassification errors. In the context of the real world problem domain used to develop the eight models for this paper, evaluation of applicants for auto loans, the decision makers would most likely have had some amount of information regarding misclassification costs, and therefore θ would probably not have represented the most appropriate model evaluation method. However, if the available data was used, for example, as a proxy to classify potential customers for a completely new product offering, where no existing data was available, and the respective misclassification costs were less understood, θ would represent a very appropriate method of evaluation. From this data set, if θ was selected as the method of evaluation, the LDA model would have been selected. A decision maker would interpret θ for the logistic model as follows – If I select a pair of good and bad observations at random, 69% of the time, the “good” observation will have a higher score than the “bad” observation.

Summary and Conclusions Accurate binary classification represents a domain of interesting and important applications. The ability to correctly diagnose the presence of a disease or correctly “score” credit applicants has tremendous consequences within each respective problem domain. Researchers and analysts spend a great deal of time constructing classification models, with the objective of minimizing the implicit and explicit costs of

26

misclassification errors. Given this objective, and the benefits associated with even marginal improvements, both researchers and practioners have emphasized the importance of modeling techniques. However, we believe that this emphasis has been somewhat misplaced, or at least misallocated. At least as much attention should be allocated to the selection of model evaluation method as is allocated to the selection of modeling technique. Although superior classification models may be developed, application of an inappropriate evaluation method could result in the selection of an inferior model and decision-making being supported through sub-optimal tools.

In this paper, we have explored three common evaluation methods – classification rate, the Kolmogorov-Smirnov statistic, and the ROC curve. Each of these evaluation methods can be used to assess model performance. However, the selection of which method to use is contingent upon the information available regarding misclassification costs. If the misclassification costs are considered to be equal, then a straight global classification rate can be utilized to assess the relative performance of competing models. If the costs are unequal, but known with certainty, then a simple cost function can be applied using the costs, the prior probabilities of assignment and the probabilities of misclassification. Using a similar logic, the K-S test can be used to evaluate models based upon the separation of each class’s respective distribution function – in the context of credit scoring, the percentage of “good” applicants is maximized while the percentage of “bad” applicants is minimized, with no allowance for relative costs. Where no information is available, the ROC curve and the θ measurement represent the most appropriate evaluation method. Because this last method incorporates all possible iterations of

27

misclassification error severities, many irrelevant ranges will be included in the calculation.

Adams and Hand (1999) have developed an alternative method, which may address many of the issues outlined above, and provide researchers with another option for consideration – the LC (or loss comparison) index. Specifically, where utilization of the simple classification rate or utilization of the K-S test for model evaluation requires a complete knowledge of misclassification costs and utilization of the ROC curves assumes no knowledge of misclassification costs, the LC index assumes only knowledge of the relative severities of the two costs. For example, a decision maker may not know the exact cost of a false positive error or a false negative error, but he may be able to estimate that a false positive is between three and five times more expensive than a false negative. Using this simple, but realistic estimation, the LC index can be used to generate a value which aids the decision maker in determining the model which performs best within the established relevant range.

The LC Index has had little empirical application to date. However, given the limitations of the evaluation methods commonly used and highlighted in this paper, it may represent an opportunity for further research and consideration.

Clearly no model evaluation method represents a panacea for researchers, analysts or decision-makers. As a result, an understanding of the context of the data and the problem domain are critical for selection, not just of a modeling technique, but also of a model evaluation method.

28

Appendix A Listing of Original Variables in Data Set Variable Name 1. ACCTNO 2. AGEOTD 3. BKRETL 4. BRBAL1 5. CSORAT 6. HST03X 7. HST79X 8. MODLYR 9. OREVTR 10. ORVTB0 11. REHSAT 12. RVTRDS 13. T2924X 14. T3924X 15. T4924X 16. TIME29 17. TIME39 18. TIME49 19. TROP24 20. CURR2X 21. CURR3X 22. CURRSAT 23. GOOD 24. HIST2X 25. HIST3X 26. HIST4X 27. HSATRT 28. HST03X 29. HST79X 30. HSTSAT 31. MILEAG 32. OREVTR 33. ORVTB0 34. PDAMNT 35. RVOLDT 36. STRT24 37. TIME29 38. TIME39 39. TOTBAL 40. TRADES 41. AGEAVG 42. AGENTD

Variable Label Account Number Age of Oldest Trade S&V Book Retail Value # of Open Bank Rev.Trades with Balances>$1000 Ratio of Currently Satisfactory Trades:Open Trades # of Trades Never 90DPD+ # of Trades Ever Rated Bad Debt Vehicle Model Year # of Open Revolving Trades # of Open Revolving Trades With Balance >$0 # of Retail Trades Ever Rated Satisfactory # of Revolving Trades # of Trades Rated 30 DPD+ in the Last 24 Months # of Trades Rated 60 DPD+ in the Last 24 Months # of Trades Rated 90 DPD+ in the Last 24 Months Months Since Most Recent 30 DPD+ Rating Months Since Most Recent 60 DPD+ Rating Months Since Most Recent 90 DPD+ Rating # of Trades Opened in the Last 24 Months # of Trades Currently Rated 30 DPD # of Trades Currently Rated 60 DPD # of Trades Currently Rated Satisfactory Performance of Account # of Trades Ever Rated 30 DPD # of Trades Ever Rated 60 DPD # of Trades Ever Rated 90 DPD Ratio of Satisfactory Trades to Total Trades # of Trades Never 90 DPD+ # of Trades Ever Rated Bad Debt # of Trades Ever Rated Satisfactory Vehicle Mileage # of Open Revolving Trades # of Open Revolving Trades With Balance >$0 Amount Currently Past Due Age of Oldest Revolving Trade Sat. Trades:Total Trades in the Last 24 Months Months Since Most Recent 30 DPD+ Rating Months Since Most Recent 60 DPD+ Rating Total Balances # of Trades Average Age of Trades Age of Newest Trade 29

43. AGEOTD 44. AUHS2X 45. AUHS3X 46. AUHS4X 47. AUHS8X 48. AUHSAT 49. AUOP12 50. AUSTRT 51. AUTRDS 52. AUUTIL 53. BRAMTP 54. BRHS2X 55. BRHS3X 56. BRHS4X 57. BRHS5X 58. BRNEWT 59. BROLDT 60. BROPEN 61. BRTRDS 62. BRWRST 63. CFTRDS 64. CUR49X 65. CURBAD

Age of Oldest Trade # of Auto Trades Ever Rated 30 DPD # of Auto Trades Ever Rated 60 DPD # of Auto Trades Ever Rated 90 DPD # of Auto Trades Ever Repoed # of Auto Trades Ever Satisfactory # of Auto Trades Opened in the Last 12 Months Sat. Auto Trades:Total Auto Trades # of Auto Trades Ratio of Balance to HC for All Open Auto Trades Amt. Currently Past Due for Revolving Auto Trades # of Bank Revolving Trades Ever 30 DPD # of Bank Revolving Trades Ever 60 DPD # of Bank Revolving Trades Ever 90 DPD # of Bank Revolving Trades Ever 120+ DPD Age of Newest Bank Revolving Trade Age of Oldest Bank Revolving Trade # of Open Bank Revolving Trades # of Bank Revolving Trades Worst Current Bank Revolving Trade Rating # of Financial Trades # of Trades Currently Rated 90 DPD+ # of Trades Currently Rated Bad Debt

30

References 1

Board of Governors of the Federal Reserve System (2003).

Adams, N. M. & Hand, D.J. (1999). Comparing classifiers when the misclassification costs are uncertain. Pattern Recognition, 32, 1139-1147. Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Ann. Eugenics, 7, 179-188. Gim, G (1995). Hybrid Systems for Robustness and Perspicuity: Symbolic Rule Induction Combined with a Neural Net or a Statistical Model. Unpublished Dissertation. Georgia State University, Atlanta, GA. Hand, D.J. (2002). Good practices in retail credit scorecard assessment. Working Paper. Hand, D. J. (2001). Measuring diagnostic accuracy of statistical prediction rules. Statistica Neerlandica, 55 (1), 3-16. Hanley, J.A. & McNeil, B.J. (1983). A method of comparing the areas under a Receiver Operating Characteristics curve. Radiology, 148, 839-843. Hanley, J.A. & McNeil, B.J. (1982). The meaning and use of the area under a Receiver Operating Characteristics curve. Radiology, 143, 29-36. Kimball, R. (1996). Dealing with dirty data. DBMS, 9 (10), 55-60. Reichert, A. K., Cho, C. C., & Wagner, G.M. (1983). An examination of the conceptual issues involved in developing credit scoring models. Journal of Business and Economic Statistics, 1, 101-114. Swets, J.A., Dawes, R.M., & Monahan, J. (2000). Better decisions through science. Scientific American, 283 (4), 82-88. Taylor, P.C. , & Hand, D.J. (1999). Finding superclassifications with acceptable misclassification rates. Journal of Applied Statistics, 26, 579-590. Thomas, L.C., Edelman, D.B., & Cook, J.N. (2002). Credit Scoring and Its Applications. Philadelphia: Society for Industrial and Applied Mathematics. West, D. (2000). Neural network credit scoring models. Computers and Operations Research, 27, 1131-1152. Zhang, G., Hu, M.Y., Patuwo, B.E., & Indro, D.C. (1999). Artificial neural networks in bankruptcy prediction: General framework and cross-validation analysis. European Journal of Operational Research, 116, 16-32.

31

Dr. Satish Nargundkar is an Assistant Professor in the Department of Management at Georgia State University. He has published in the Journal of Marketing Research, the Journal of Global Strategies, and the Journal of Managerial Issues, and has co-authored chapters in research textbooks. He has assisted several large financial institutions with the development of credit scoring models and marketing and risk strategies. His research interests include Data Mining, CRM, Corporate Strategy, and pedagogical areas such as web-based training. He is a member of the Decision Sciences Institute. Jennifer Priestley is a Doctoral Student in the Department of Management at Georgia State University, studying Decision Sciences. She was previously a Vice President with MasterCard International and a Consultant with Andersen Consulting. Her research and teaching interests include Data Mining, Statistics and Knowledge Management.

32