Scorecard development process

4

BORROWED MEASURES

Some time back I bought a book called “An Introduction to Credit Risk Modelling” (Bluhm et al. 2003), thinking that it would be another book with many of the same concepts that I had become familiar with. Much to my surprise, the book was filled with lots of maths and stats that were totally unfamiliar to me. The reason… the book was focussed primarily on modelling the credit risk of businesses—the difference being that outside of micro and small businesses there is not a lot of data. First, the number of business failures, defaults, bankruptcies—or whatever you wish to call them—are few. Second, the time frames required may be long. The stats and maths begin presented here are those appropriate for big data environments, which are also sometimes used for smaller businesses. They have been borrowed from a variety of disciplines, including cryptography (code breaking), economics, signal detection, and and and. While I have a great interest in the origins of these calculations, you may not share that interest, so I’ll try to keep the historical details to the minimum required to provide context. Many of the tools mentioned are summary statistics that are intended to give an indication of models’ predictive power or stability. Care must be taken when using any of them, as summary statistics miss vagaries in the underlying distributions from which they are calculated. Gini coefficients, AUROCs, KS-statistics and the likes should thus be combined with other measures or data visualisation tools when assessing models that provide similar results. Some of the statistics also share the distinction that they are based on ‘empirical cumulative distribution functions’ (ECDFs), i.e. the cumulative percentages after the data has been sorted by some measure.6 When assessing hits and misses, subjects are sorted by the ‘expected’ miss rate and the cumulative percentage of hits and misses are calculated for each subject or group. Stats and graphs based on the ECDF include the Lorenz curve, Gini coefficient, ROC and AUROC, and the K-S plot and statistic—most of which are used to assess models’ predictive power. One should also note at the outset that when used to assess predictive power, most statistics only assess models’ ability rank cases correctly—not the absolute level of risk. Predictive models are built upon and use a certain set of data, and they may work perfectly well when discriminating case-by-base levels of risk, but they cannot do much when extraneous factors increase the overall level of risk affecting all cases.

4.1

Borrowings from economics

At this point we start focussing on statistics that measure the power of the final model. The first come from the field of economics… the illustration and measurement of inequalities amongst people in societies. The origins trace back earlier, but start with what we today call the ‘Pareto principle’, or 80/20 rule, first postulated by Vilfredo Pareto in 1896 when he noted that 20 percent of the people in Italy owned 80 percent of the land.

The concept of an ECDF is well known in statistics, but in predictive modelling seems to be used primarily in the domain of corporate credit-risk rating amongst the rating agencies. 6

Credit Risk Modelling

Many paths through the forest

Page 32 of 215

4.1.1

Lorenz curve

Thereafter, an American economist called Max Otto Lorenz published a paper in 1905, focussing specifically on wealth distributions. The chart for illustrating it has since become known as the Lorenz curve. This is one of many tools based on the ECDF, but in this instance rather than hits and misses sorted by probability, we are instead assessing individuals sorted by their shekels. In Figure 4-1, citizens have been sorted from wealthiest to poorest, and then the cumulative percentages of wealth (‘X’) and population (‘Y’) plotted at each point. From a quick glance it is easily deduced that about say 60 percent of wealth (or land, or income) is held by 20 percent of the population. The diagonal represents a totally uniform wealth distribution, and if the curve were to hug the left and upper borders it would mean all wealth is held by one very rich but probably sad person. Figure 4-1 — Lorenz curve

The only problem with this curve is that it is difficult to make direct comparisons of different distributions. The question would be asked, “Is the inequality in Italy more than the USA or England or Spain?”

4.1.2

Gini coefficient

It was another five years before a solution was found. In 1910 Corrado Gini he presented a formula for what we today call the ‘Gini coefficient’— the ratio of ‘B’ to ‘A+B’. It is one of many characteristics calculated based upon characteristics’ ‘empirical cumulative distribution function’ (ECDF), which is a fancy way of saying cumulative percentages—i.e. the percentage of cases that have a value at or below each of the possible values. The ECDF is typically expressed as a function F *i . If the X is population and Y wealth, the subjects are sorted from richer to poorer, and the cumulative percentages are recalculated as one moves from subject to subject. The end result is a measure of wealth disparity. The formula can be stated as:



Page 33 of 215



Equation 4-1: DGini  1  F  X 1  F Y 1  

 F  X  n

i 2

i



 F  X i 1   F Y i  F Y i 1 

Corrado Gini (1885-1965) was an Italian social scientist. His first published work in 1908 proposed that the tendency for a family to produce boys or girls was heritable. In his 1910 work, he sought to disprove Pareto’s claim that income inequality would be less in wealthier societies. He published The Scientific Basis of Fascism in 1927, and founded the Italian Committee for the Study of Population Problems in 1929. He was one of the few fascists to survive the post-World War II purge, because of the quality of the committee’s work.

This is only one possible representation of the formula, and if you think it is complicated you should see the other guy! The end result is always a value between 0 and 1, if done correctly, but is usually published as a percentage. In economics, these extreme values indicate perfect equality and perfect inequality, ‘the people share equally’ to ‘one guy’s got it all’.7 Table 4-1 — Gini coefficient illustration a= x[i]+ X 500 1000 2000 1000 500

f(X) 10% 20% 40% 20% 10%

5000

100%

4.2 4.2.1

F(X) 10% 30% 70% 90% 100%

x[i-1] 0,10 0,20 0,40 0,20 0,10

b = y[i]Y 5 25 100 100 100

f(Y) 1,52% 7,58% 30,30% 30,30% 30,30%

F(Y) 1,5% 9,1% 39,4% 69,7% 100,0%

330

y[i-1] 0,015 0,106 0,485 1,091 1,697 Gini

a*b 0,002 0,021 0,194 0,218 0,170 39,5%

Borrowings from probability theory Laplace—expected values

The concepts of much of modern statistics, or at least probability theory, were first being postulated in the 17th century. It was only in 1814 thought that Pierre-Simon Laplace explicitly defined an expected value as “the product of the sum hoped for by the probability of obtaining it”. This is one of the basics of gambling, a combination of probability and payoff where wagers are only made if the expected payoff is in the gambler’s favour. n

Equation 4-2: Expected value

E[V ]   pi vi i 1

Of course, the interest in this simple equation extended far beyond casinos and cards, including into the realms of scenario analysis whether for individual projects are the fates of firms or nations. In credit, this has been permuted into an expected loss calculation that For the Figure 4-1 illustration, the Gini coefficient is 55%. When applied to income distributions, the norm for developed countries ranges from 25 to 50 percent, and up into the mid- to high-60s for some developing countries—i.e. before social welfare payments and taxes are taken into consideration. 7



Page 34 of 215

assumes you cannot win, only manage how much is lost. Simply stated, an investment value is multiplied by the probability of failure, as well as value at failure and loss given failure factors. These are more commonly referred to as the probability of default (PD), exposure at default (EAD), and loss given default (LGD), and these terms will recur in this book. Equation 4-3: Expected loss

EL  v  PD  EAD  LGD

The relationship between these elements can be stated as set out in Equation 4-3, where the value ‘v’ is the current balance, and the others are percentages. Should it appear confusing, EAD is an adjustment to reflect how much the value has changed by the default event, and LGD another factor to reflect how much is really lost.

4.2.2

Kolmogorov-Smirnov—curve and statistic

We now move on to the former Soviet Union and the double-barrelled Kolmogorov-Smirnov (KS, or K-S) concepts. These are non-parametric means (i.e. not limited by any assumptions about the underlying distribution) of illustrating and measuring the differences between probability distributions, whether sample and theoretical or two samples (one- and twosample tests respectively). Kolmogorov first proposed them in an Italian actuarial journal in 1933, while Smirnov built on the proposition in 1939 and tabulated it in 1948. The curve is a very effective data visualisation tool for assessing the difference between the ECDFs of binary hit/miss outcomes after cases have been ranked according to the probability of a miss (it is sometimes referred to as a ‘fish-eye’ curve). The statistic that is derived therefrom, is then the maximum (supremum) absolute distance between the two ECDFs for all points ‘i’ where it was calculated. Equation 4-4: KS Statistic

KS  sup F Hit i  F Miss i



Figure 4-2 — Kolmogorov-Smirnov curve

The illustration of the Figure 4-2 is typical of that produced in strong-signal big-data environments. Where signals are weak and data is limited however, as often happens in psychology and medicine, the lines will be jagged and may criss-cross.



Page 35 of 215

There are ways of determining confidence intervals for the K-S statistic, whether assessing if sample results match a theoretical distribution, or have been drawn from the same pool Andrey Nikolaevich Kolmogorov (1903-‘87) was an acclaimed mathematician who made significant contributions in the field of probability theory, while Nikolai Vasilyevich Smirnov (1900-‘66) was a much lesser known mathematician about whom little biographical material can be found, other than that he worked at the V.A. Steklov Institute of Mathematics from 1938, and became head of Mathematical Statistics in 1957. as another sample. These are seldom used in predictive modelling, hence are not covered here.

4.2.3

Gudak—Weight of Evidence (WoE)

Another statistic that is related to gambling is the “weight of evidence”, which here is not being used in the legalistic sense of the courtroom, but raw numbers. The concept was the brainchild of I.J. Good in 1950, which he presented to show how the human mind assesses risk. If you think of most gambling, we are presented with odds, not probabilities—and the human mind has an inherent way of making the odds linear—2, 4, and 8 to 1 odds become 21, 22, and 23 to 20 odds when simplifying problems to make decisions, like whether or not to cross the street. The formula has several incarnations, but one of the most commonly used and easiest to understand is that in Error! Not a valid bookmark self-reference.. Simply stated, it is the natural log of the odds for a subcategory, compared to the same calculated for all. Equation 4-5: Weight of evidence

 p   p   Wi  ln i   ln 1  p 1  p   i  

The same formula is now commonly used as a measure of risk, especially of a sub-group relative to whole, negative values result if the risk is higher than average, positive for lower. This in turn makes it possible to make comparisons between sub-groups. Given that the calculation can be done for every characteristic, it has the further advantage of enabling the calculation of correlations between them. I.J. (Jack) Good (1916-2009) was an Englishman of Polish-Jewish extraction (b. Isadore Jacob Gudak) who was one of the Bletchley Park cryptanalysts working with Alan Turing during World War II to break the German “Enigma” codes. For whatever reason, Jack has not featured in any of the books or movies that were inspired by those years. He supposedly frustrated Turing with his daytime naps, yet broke one of the codes in a daytime dream. After the war Jack moved into academia, and in 1950 published a book called “Probability and the Weighting of Evidence”, which built on some of the cryptanalyst techniques used at Bletchley to illustrate how the human mind assesses risk.



Page 36 of 215

4.2.4

Kullback—divergence statistic

Another statistic that has a link to Bletchley Park, besides the weight of evidence, is the Kullback-divergence statistic. This provides a measure of the difference in two frequency distributions, and is calculated as per Equation 4-6. Equation 4-6: Kullback-divergence

KD 

n

 ln pa i 1

In it, the expression

i

| X   a i | Y    p a i | X   p a i | Y 

pai | X  is the probability that a case has attribute a, only for those

cases where X is true. For example, if an applicant was accepted, what was the probability that he was from Algeria. This is then compared to those not accepted, to see if whether or not nationality has any bearing on acceptance. In credit scoring the statistic is seldom referred to by this name. Rather, the name morphs depending upon how it is being applied, as it provides the basis for both the information value and population stability index calculations (see 4.4.2 and 4.4.3 respectively). I knew of these calculations for some time and was unaware of the common calculation, until it was pointed out by a colleague. Solomon Kullback was an American cryptanalyst who was seconded briefly to Bletchley Park after the USA entered the war, and went on to become the Chief Scientist at the National Security Agency. He published his formula in a book called “Information Theory and Statistics” in 1958, about the same time that credit scoring was first proposed as a concept to businesses by Fair Isaac and Company (FICO). FICO saw the usefulness of the statistic, but gave it new names for their business audience without acknowledging the author or whence it came.



Page 37 of 215

4.3

Borrowings from signal detection theory

It is interesting how gamblers and investors will always remember and brag about their wins, and tend to forget and downplay their losses. Of course, one can just measure the money, but it also helps to have other means in place to track the outcomes, “Did I get it right?” There are a series of tools that have been developed to assess exactly that.

4.3.1

Confusion matrices

Perhaps the most basic tool for assessing predictions is a simple table stating what was predicted, and was the prediction right or wrong. The table has been given several labels ‘error’, ‘misclassification’, or ‘confusion’ matrix—the latter because it indicates how much the results are confused, whether the assignment is into two or more groups. It is presented here first, because if is one of the easiest concepts to understand. That said, today it is most commonly associated with machine learning. You will likely have heard of the expression Table 4-2 — Confusion matrix “He tested negative!” in medical or criminal Actual investigation series, to indicate the absence Predict Hit Miss of a malady, narcotic, or other marker (think Hit TP FP (Type II) ‘HIV-negative’). Much faith is put in the Miss FN (Type I) TN result, but tests can be wrong. The confusion matrix states all of the possible combinations of test and truth for each of the classifications—true or false; success or failure; dog, cat, or cow (the test and truth outcome definitions can differ, such as bad and good versus default and not default, but that is not the norm). With binary outcomes, there are four possible combinations as per Table 4-2. For predicted and actual, outcomes can be positive and negative (hit or miss) and true or false (right or not)—and counts are tallied for each. Wrong predictions are called “Type II” and “Type I” errors for hit and miss respectively (I know, it would have made more sense the other way). From these a number of different of different ratios can be calculated, the most important being ‘sensitivity’ and ‘sensitivity’, being the percentage of hit and miss predictions that were correct, respectively (also called the true-positive or hit rate, and the true-negative rate). Others are: accuracy ((𝑇𝑃 + 𝑇𝑁)⁄(𝑃 + 𝑁)) positive predictive-value, or precision (𝑇𝑃⁄(𝑇𝑃 + 𝐹𝑃)) negative predictive-value (𝑇𝑁⁄(𝑇𝑁 + 𝐹𝑁)) false-positive, or fall-out rate (𝐹𝑃⁄(𝐹𝑃 + 𝑇𝑁)) false-negative, or miss rate (𝐹𝑁⁄(𝐹𝑁 + 𝑇𝑃)) false discovery rate (𝐹𝑃⁄(𝐹𝑃 + 𝑇𝑃)). Most tests do not give an outright prediction, but rather a probability, and there will be the issue of where to set the classification cut-off—at least when the test or model is first developed. The most logical is to it where the number of predicted failures and successes provided equals the actuals in the development data—i.e. if you wish to do a direct comparison of predicted to actual. There are instances though, where the classifiers used to develop and assess the model are not the same (e.g. good/bad versus default/non-



Page 38 of 215

default), and others where one wishes to assess the results across a range of possible cutoffs. Figure 4-3 — Marginal and cumulative gains/costs

One can, of course, also assign gains and costs to the different outcomes, and use it to determine marginal and cumulative gains at different cut-off points—which helps one to move away from the stiatistical to the practical. That said, in the credit world it can be very difficult to work out what the costs are—especially when dealing with concepts like life-time customer value. Nonetheless, the exercise can prove very useful when trying determine cutoffs.

4.3.2

Receiver Operating Characteristic (ROC)

With that, we now jump to signal-detection theory. The true/false, positive/negative framework from earlier also applies here, only it has a much earlier origin (or rather, other disciplines borrowed from signal-detection theory). Positives are the signal of an (enemy?) plane, negatives indicate it is not. The Receiver Operator Characteristic, or ROC curve, was a visual representation of the proportion of true positives to false positives at different thresholds, as shown in Figure 4-4, which varies from the Lorenz curve only in its labelling. Indeed, one wonders why two sets of terminologies are used, as one would have sufficed. In credit scoring, cases are sorted in the order of descending risk and the x- and y-axes are goods and bads respectively. Should one model’s ROC curve be dominated (up and left) by another’s, the dominator is can be claimed to be the better model—lower errors at any cutoff. Should the curves cross though, one model may be preferred over the other—typically that with more hits where more hits are predicted (i.e. the steeper of the two).

The concept of signal detection theory really sounds obscure until it is put into context. After the Japanese attacked Pearl Harbor in 1941 the Americans started researching new means of detecting enemy aircraft, and needed to measure the efficacy thereof. Surprisingly, no specific author has been associated with it, probably because it resulted from a highly secret war effort. In the 1950s and ‘60s it was adopted in psychology to assess behavioural patterns that were hardly discernible, and could not be explained using existing theories. Today, the ROC is used widely in medicine, engineering, and other fields—including credit scoring.



Page 39 of 215

Figure 4-4 — Receiver Operating Characteristic (ROC curve)

4.3.3

Area under the ROC (AUROC or AUC)

Like the Lorenz curve, the ROC also has a summary statistic that provides the ratio of ‘B+C’ to ‘A+B+C”. It is not unsurprisingly called the ‘area under the ROC’ or ‘AUROC’ (which sounds uncannily like ‘orc’ from Lord of the Rings) and has a near-perfect relationship with the Gini coefficient of c ≈ (D+1)/2, even though the formula appears totally different. Equation 4.7. AUROC

c  PrSTP  STN   0.5 PrSTP  STN 

Thus, AUROC is the probability that the signal for a true positive will be weaker than that for a true negative, plus 50 per cent of the probability that the two signals will be equal. Stated differently, it is the probability that the model/test will rank a randomly-chosen positive higher than a random-chosen negative. A value of result of 50 percent implies the signal is flip-a-coin random, 100 percent god-like perfect classification, and 0 percent somebody is seriously trying to deceive us. In disciplines like psychology where signals are barely discernible, values will be close to 50 percent. There is a very straightforward relationship AUROC and the Gini coefficient of economics fame, as shown in Equation 4-8, which makes conversion very straightforward should translation be required: 1  DGini Equation 4-8: Gini vs. AUROC and DGini  2  DAUC  1 D AUC  2



Page 40 of 215

4.4 4.4.1

Measures of power and separation Weight of evidence and univariate points

A formula was presented earlier for the weight of evidence, as it would be applied to probability theory. This is now restated for use in credit scoring. Simply stated, the weight of evidence for any group is the natural log of the success/failure odds for the group, less the same calculated for the pool. Equation 4-9: Weight of evidence

 S  S   Wi  ln i   ln F  F  i  

This primary use of the weights of evidence is for transforming characteristics into variables that will be used as part of the model build, and at the end the coefficients derived will be manipulated into points that will form part of the scorecard. Further, given that the calculation can be done for every characteristic, it has the further advantage of enabling the calculation of correlations between them. The weights of evidence can also be used as a very useful tool when coarse classing characteristics—i.e. defining the groups for each. Something that is very useful is to convert the weights of evidence into what I call “univariate points”—i.e. the points that would be assigned to a characteristic if it were the only one used in a model—as per Equation 4-10.

U i  Wi 

Equation 4-10: Univariate points

40 ln( 2)

For many this would seem like jumping the gun, as here we are bringing in the concept of “scaling” which is only covered much later. Suffice it to state though, that this allows a very useful tool for assessing the risk of any particular group relative to the rest. The numbers are calculated such that 40 points double the odds, making it very easy to assess the risk of any one group relative to the rest. Table 4-3 provides an example of the calculations of both the weights of evidence and univariate points. Table 4-3 — Weight of evidence and univariate points Group A B C D Total

Success 1 095 3 924 2 150 2 279 9 448

Failure 415 275 381 222 1 293

Bad% 27,48% 6,55% 15,05% 8,88% 12,04%

Odds 2,6 14,3 5,6 10,3 7,3

WoE -1,019 0,669 -0,258 0,340 0,000

UniPts -59 39 -15 20

Jack Good claimed that the smallest odds increment that the human mind can distinguish is 25%, i.e. the difference 4/4 and 5/4, which translates into just under 13 points. Most credit-risk grading systems, such as those provided by Moodys, Fitch, and Standard & Poors, have a doubling of odds every second grade, so 10 points would be half a grade— which is basically saying that it would be very difficult for any judgmental assessment—no matter how comprehensive—to be more accurate than half a grade.



Page 41 of 215

4.4.2

Information value (IV)

The Kullback divergence statistic was covered in section 4.2.4. The information value is a variation, one of several measures used to assess the predictive power of a characteristic. It is similar to the weight of evidence in that it also uses natural logs, but rather than focussing on a single sub-group it provides a weighted average summary. In doing so, it uses only the percentage distributions of the successes and failures in each group. The formula is provided in Equation 7.1, along with an example in Table 4-4. Equation 4-11: Information value n  Si  S   Si Fi   IV   ln   F F   S F  i 1  i    Unfortunately, it is not possible to determine confidence intervals and other things much sought after by statisticians. That said, the information value is probably the best possible tool for assessing characteristics’ predictive power, because it doesn’t require the sub-groups to be sorted in order of risk—which in some instances would make sense and others not.

Table 4-4 — Information value calculation Count Group

Success

A B C D Total

Column %

Failure

Success

095 924 150 279

415 275 381 222

11,59% 41,53% 22,76% 24,12%

32,10% 21,27% 29,47% 17,17%

9 448

1 293

100,00%

100,00%

1 3 2 2

Failure

a S/F 0,361 1,953 0,772 1,405

b

IV

S-F

ln(a)*b

-0,205 0,203 -0,067 0,069

0,209 0,136 0,017 0,024 0,385

The question that arises with most statistics is the benchmarks that should be used when determining whether a characteristic is predictive or not. These shouldn’t be cast in stone, but typically a characteristic is considered unpredictive if the value is less than 0.02, weak if below 0.10, fair up to 0.20, strong up to 0.50, very strong above 1.00, and over-predictive if greater than 1.00. These values apply to originations developments, and the thresholds may vary elsewhere. And overpredictive… that means that it is so powerful it is dangerous… like the entire model may be dominated by a single characteristic, so be careful. Figure 4-5 — Information value – how powerful?



Page 42 of 215

4.4.3

Population stability index (PSI)

Next on the Bletchley Park list is the population stability index. Rather than assessing the differences between the frequency distributions of successes and failures though, it uses the same formula to instead measure how much a frequency distribution has changed over time—i.e. the “population shift”. The baseline is usually all subjects in training sample that influenced the development (see Table 9-3, which excludes disqualifications), as compared to those with the same outcomes at later stages—out-of-time, recent, pre-implementation, post-implementation. The formula is that shown in Equation 4-12, and the table… well, imagine the Table 4-4 example with training and recent replacing success and failure respectively. n  Ti  T   Ti Ri   PSI   ln      i 1  Ri  R    T  R  While information values are applied almost exclusively to predictors, the stability index is applied to both predictor and predicted, model input and output, characteristic and score. And even though model outputs may seem to indicate plain sailing, a review of the predictors can indicate sea changes beneath the surface. A traffic-light approach is generally used for the benchmarks: green, for values to 0.10 meaning little or no difference; yellow, from 0.10 to 0.25, some change but not serious; and 0.25 upwards, meaning change significant enough to warrant some sleuthing. Violating these standard thresholds does not mean that a predictor should not be used or the model is invalid, only that some investigation may be required. If values over 1.00 are encountered however, the situation is serious—either resulting from massive changes to the population or processes.

Equation 4-12: Populations Stability Index

Figure 4-6 — Population stability index – traffic lights

Figure 4-6 illustrates a population shift, unfortunately without the benefit of colour, as that would push up the printing costs five-fold (seriously!). The red and yellow thresholds only approximate of the bounds, as the distributions can take on a multitude of shapes as the mean, standard deviation, skewness, kurtosis, etc. vary.



Page 43 of 215

4.4.4

Lorenz and Gini

The Lorenz curve used in economics has found another role in predictive analytics, the difference being that rather than sorting by shekels, the cases are sorted by their estimates. For continuous outcomes, estimates are the values being predicted—X is the estimate and Y the actual value. For binary outcomes, estimates are probabilities—X are success and Y failure counts. In credit scoring the Y- and X-axes are usually defaults and non-defaults, however defined, the exception being the Cumulative Accuracy Profile where X is the total case count. Like the Lorenz curve, this Gini coefficient calculation has been hijacked. In predictive statistics, ‘failure’, ‘bad’, ‘default’, or ‘hit’ counts are substituted for Y and their ‘non-’ counterparts for X. Values can theoretically range from -1 to 1, from perfectly wrong to perfectly right passing through flip-a-coin random 0 along the way. Negative values are possible, but very rare except when testing new models in very weak-signal environments like psychology.



Equation 4-13: D  1  F G 1  F B 1  

 F G  n

i 2

i



 F B i 1   F G i  F B i 1 

When assessing credit-risk scores, application scores will typically have values ranging from 30 to 65 percent, behavioural scores from 40 to 80 percent, and collections scores from 45 to 55 percent. These numbers are not cast in stone, and should only be used as loose guidelines. In data poor environments even lower numbers will come through. The greatest differences will arise from:  The depth, breadth, and appropriateness of available data;  The homogeneity/heterogeneity of the population being assessed. 4.4.4.1 Gini variance When statistics are calculated on samples, they can only provide estimates of the true values. Hence, we speak of confidence intervals—e.g. we can say with 95 percent confidence that X lies in the interval between A and B—which are calculated based upon the ‘standard deviation’, which is the square root of the ‘variance’. In some cases these calculations are quite simple, but in others can descend into monstrous levels of complexity. There are formulae for the AUROC variance, but most are beyond the comprehension of those lacking IQs in the upper stratosphere. In contrast, there are a few formulae of the Gini variance that were summarised in a brief article by Gerard Scallan (2007). The simplest of these are easy to apply, making reference only to the total values of X and Y, or total counts of successes and failures—but they only provide upper boundary and hence tend to overestimate. In the formulae below,

D is the Gini coefficient, N is the number of cases, and G and

B the number of goods and bads, or successes and failures, respectively. The earliest was provided by van Dantzig (1951), the exact reference details for which I am unfortunately not able to find, which is extremely elegant in its simplicity. In the formula Equation 4-14: var(D)—van Dantzig




2 Var ( D )  1  D





min N G , N B




Page 44 of 215

The other formula was provided by Bamber (1975), which is only slightly more demanding, and is that most commonly used by businesses. In most instances it is sufficient, and probably only overestimates the value by ten percent. Equation 4-15: var(D)—Bamber Var ( D) 

2N

G

 



 1  1  D 2  N G  N B   1  D  3N G N B

2



The more precise and complicated (in near biblical proportions) formula takes into consideration the underlying distribution that was used to derive the Gini, and hence is a lot more difficult to implement and calculated. It was published in 2003 by Engelmann, Hayden, and Tasche under the auspices of the the Deutsche Bundesbank, and Gerard Scallan presented it with some adaptations, if only to make it more accessible. Equation 4-16: var(D) – Engelmann, et al.

 

  N B  1  PG S  1  2  F B S 1 2     N G  1  PB S  1  2  F G S 1 2    PG S  P B S     4   N G  N B  1 D 2  1 Var ( D )   N G  1 N B  1

       

The intention was that the formula would be applied to score distributions, and the cases sorted in descending order of risk (increasing score). of the total of X scoring less than the score S, and

F ?S 1

P?S

is the cumulative percentage

is the percentages scoring exactly

S. The five lines on the right hand side form the numerator and the single line on the left the denominator of the equation. The confidence interval then depends upon the level of certainty that one wishes to have. The most common is 95 percent, in which instance the formula would be: ⃡ = 𝑫 ± 𝟏. 𝟗𝟔 × √𝑽𝒂𝒓(𝑫) Equation 4-17: Gini confidence interval 𝑫

4.4.5

Cumulative accuracy profile and accuracy ratio

There is, however, a variation the A variation on this is the ‘cumulative accuracy profile’, or CAP curve, which differs from the ROC curve only in that the x-axis covers not just the misses, but all cases. It also has its own summary statistic called the ‘accuracy ratio’ (AR), calculated the same way as AUROC (below). These was first presented by Sobehard et al. in 2000, who were employees of Moody’s Investors Service—which focuses more on wholesale credit ratings, not retail. Nonetheless, the concept is still applicable. Note, that most people who are analytically inclined are likely to prefer the ROC and AUROC, while business people without that blessing (if you can call it that) will likely prefer CAP and AR.



Page 45 of 215

4.4.6

Divergence statistic

A commonly-accepted statistic is the divergence statistic, which is much like a cybersquatter—many of the statistics presented here, only this one hogged the name. One distribution diverges from another, and we want to quantify the distance as

D

Equation 4-18: Divergence-statistic where

 is the average value and  2

G   B 2



2 G



  B2 2

the value’s variance for that group. If both groups are

normally distributed and have the same variance, the result equals the Kullback-divergence measure (see Section 4.4.1). The results of the formula will vary depending upon where it is used. In credit scoring the most common application is to assess the value of a scorecard, where the mean and standard deviation are calculated using the scores. If the models are working well, the resulting values should be between 0.5 and 2.5 for an application model, and between 1.0 and 3.5 for a behavioural model.

4.4.7

Deviance odds – power and accuracy

Many of the statistics presented up until now measure model’s predictive ‘power’, but do nothing with respect to its predictive ‘accuracy’. The former assesses ranking ability, the latter the ability to provide the correct prediction overall for the bad/default/positive/hit rate. As a general rule, power is much more important than accuracy, as accuracy can be attained through calibration once the model has been developed. Nonetheless, there are times when we wish to assess both power and accuracy—and very few measures exist for the latter. The likelihood function was already covered in section 3.1.2. As it transpires, it is one of the few measures that can be split into both a measure of power and accuracy. Almost everything relating to log likelihood and deviance was presented in terms that can be found in the statistical literature. Now we are going out on a limb… you will not find this elsewhere. It was covered in the Credit Scoring Toolkit, but at that stage the relationship with the model deviance was neither well understood nor explained. Basically, the deviance associated with log likelihood can be converted into an average odds ratio: Equation 4-19: Deviance odds ratio

D  L2   expD   exp   exp  n  n 

The trident symbol (the Greek letter ‘psi’) has been hijacked, only because nothing else seemed appropriate. The same formula is applied not only to the fitted model for a particular dataset, but also to both the naïve observed and naïve estimated totals. These can then be used to assess both the power and accuracy of a model as percentages:



Page 46 of 215

 Eˆ  ˆ

Equation 4.20. Power

Power 

Equation 4.21. Accuracy

Accuracy  1 

ˆ

where:

=

 Eˆ  1 ( Eˆ   O )

 E

set of estimates generated by a model;

ˆ E

= total estimate for that dataset; and = observed or actual values for the dataset. O Both the total observed and estimated values are used to calculate naïve odds. For a saturated model (i.e. perfect prediction) both power and accuracy will be 100 percent. For a naïve model for a development dataset where the totals of both estimated and observed are equal, the accuracy will be 100 percent but power 0 percent. An example of the likelihood and deviance calculations was provide in Table 3-2, for which the resulting values are: Model

L

D

ψ

Power

Test

2,622

5,244

1,791

73,2%

Naïve Est

6,183

12,365

3,951

100,0%

Naïve Obs

6,183

12,365

3,951

Accr'y

In this instance, the accuracy is 100 percent because the model was fitted to that dataset. But what happens when the same model is applied to a different dataset; both the power and accuracy can be affected. An example is provided in Table 4-5, where only one case (fifth row) is reclassified. The power drops from 73.2 to 68.2 percent, and the accuracy from 100 to 95 percent. Table 4-5 – Deviance – power & accuracy I 1 2 3 4 5 6 7 8 9 Total

y 1 0 1 1 1 0 1 0 1 6

µ 0,90 0,10 0,80 0,70 0,45 0,35 0,70 0,20 0,80 5,00

ll, L 0,105 0,105 0,223 0,357 0,799 0,431 0,357 0,223 0,223 2,823

d 0,459 0,459 0,668 0,845 1,264 0,928 0,845 0,668 0,668

d2, D 0,211 0,211 0,446 0,713 1,597 0,862 0,713 0,446 0,446 5,646

Model Model / Test Naïve - Estimates Naïve - Actuals Power Accuracy

L 2,823 5,960 5,729

D 5,646 11,919 11,457

ψ 1,873 3,760 3,572

68,38% 95,00%

Whether or not that loss is tolerable will depend upon the circumstances. The loss of power may be acceptable, as 66 percent is still a strong model. The change in accuracy will only be an issue if the model is being used to predict overall performance (as opposed to just providing a rank order). If so, a 20 percent variation (i.e. 6 versus 5) would suggest that some action is necessary. If anything, the 95 percent probably understates the extent of the problem, and the maximum tolerance should be set higher.



Page 47 of 215

Scorecard development process

Scorecard development process

Suggest Documents

Scorecard development process

Scorecard development process

Development of a Sustainability Balanced Scorecard: Translating ...

internal process, learning perspective of balance scorecard and ...

BALANCED SCORECARD The balanced scorecard translating ...

humane scorecard humane scorecard - Humane Society Legislative ...

humane scorecard humane scorecard - Humane Society Legislative ...

RTC scorecard

Scorecard - GolfBC

Sample Size Development Process

Eclipse Development Process

New Product Development Process

Curriculum Development Process - acara

SOFTWARE TEST PROCESS DEVELOPMENT

COMPETENCES DEVELOPMENT PROCESS

RRP6 Development Process - UNHCR

Development Process? - Google Sites

BOOSTER*Process A Software Development Process ... - CiteSeerX

The Leadership Development Process

Network Security Development Process

Site Development Process

Scorecard 2011

BOOSTER*Process A Software Development Process ... - CiteSeerX

Scorecard Ansty