A class of logistic-type discriminant functions By SHINTO EGUCHI Institute of Statistical Mathematics, Minami-azabu, Tokyo 106-8569, JAPAN
[email protected] JOHN COPAS Department of Statistics, University of Warwick, Coventry CV4 7AL, U.K.
[email protected]
Summary In two-group discriminant analysis, the Neyman-Pearson Lemma establishes that the ROC curve for an arbitrary linear function is everywhere below the ROC curve for the true likelihood ratio. The weighted area between these two curves can be used as a risk function for finding good discriminant functions. The weight function corresponds to the objective of the analysis, for example to minimize the expected cost of misclassification, or to maximize the area under ROC. The resulting discriminant functions can be estimated by iteratively reweighted logistic regression. We investigate some asymptotic properties in the “near-logistic” setting, where we assume the covariates have been chosen such that a linear function gives a reasonable (but not necessarily exact) approximation to the true log likelihood ratio. Some examples are discussed, including a study of medical diagnosis in breast cytology.
Some key words: Discriminant analysis; Logistic regression; Neyman-Person lemma; ROC curves.
1
1. Introduction Starting with the pioneering work of Fisher (Fisher, 1936), discriminant analysis has become one of the most important techniques in applied statistics. In the simplest case, we have two populations which we label y = 0 and y = 1, and a vector of covariates x. In linear discriminant analysis, we aim to find a linear function s = β x which can be used to discriminate between the two populations. A good discriminant score s is one which takes large values when y = 1 and small values when y = 0. The basic idea is to choose β such that the two conditional distributions of s given y are, in some sense, separated as much as possible. The classical Linear Discriminant Function (LDF) introduced by Fisher achieves this separation by maximizing the (standardized) difference between the means of these two conditional distributions. Alternatively, we could minimize the risk based on a penalty function which increasingly penalizes large values of s when y = 0 and small values of s when y = 1. If U(s) and V (s) are two monotonically increasing functions of s, such a risk function could be written in the form D(β) = −π1 E1 {U(β x)} + π0 E0 {V (β x)} = E {−yU(β x) + (1 − y)V (β x)}
(1) (2)
where πy is the probability of y, Ey denotes conditional expectation given y, and E is the usual notation for (marginal) expectation. This paper discusses the class of discriminant functions given by minimizing the risk function (1), for suitable choices of U and V . In practice, the choice of discriminant function should depend on the use to which it is to be put since, at least in principle, this should determine the most appropriate penalty function. Three examples, to be discussed in more detail later, are • Hit-Rate. Here we want to find β and a threshold u so that we can make the decision yˆ = 1 if β x ≥ u and yˆ = 0 if β x < u. The object is to maximize the “hit-rate” pr(ˆ y = y) = π1 pr(β x ≥ u|y = 1) + π0 pr(β x < u|y = 0).
(3)
• Credit Scoring. Here we accept applicants for credit only if β x ≥ u. A simple cost model envisages, for each accepted applicant, a profit a1 if y = 1 and a loss a0 if y = 0. The expected profit is then π1 a1 pr(β x ≥ u|y = 1) − π0 a0 pr(β x ≥ u|y = 0).
(4)
• Screening. In medical screening for a particular disease we could define a high-risk group to be those members of the population with β x ≥ u. We need to balance the rates of false positives and false negatives, given by F Pu and F Nu with F Pu = pr(β x ≥ u|y = 0) and F Nu = pr(β x < u|y = 1). 2
These are summarized by the receiver operating characteristic (ROC) curve, the graph of 1 − F Nu against F Pu . The success of the screening score can be measured by the area under the ROC curve −∞ (1 − F Nu )d[F Pu ] (5) A= u=+∞
(the limits are reversed because F Pu is a decreasing function of u). The closer A is to 1 the better is the screening score. We show in Section 2·4 below that the problems of maximizing (3), (4) and (5) are all special cases of minimizing D(β) in (1). To establish the notation to be used throughout the paper, let gy (x) be the conditional probability density function of x given y, and let g(x) be the overall (marginal) probability density function given by g(x) = π0 g0 (x) + π1 g1 (x). For simplicity of notation we assume that the components of x are all continuous random variables, with the obvious change of notation and formulae if some or all of the components of x are discrete. We define p(x) = pr(y = 1|x) so that the log-odds for y given x are λ(x) = log
p(x) g1 (x) π1 + log = log . 1 − p(x) π0 g0 (x)
(6)
Thus, apart from the constant log(π1 /π0 ), λ(x) is the log likelihood ratio for discriminating between g0 (x) and g1 (x). Most statistical methods are based on models, and here there are two ways of modelling the joint distribution of y and x. Classical discriminant analysis, using the so-called sampling paradigm, is based on modelling the two separate distributions g0 (x) and g1 (x). For example, if these distributions are multivariate normal with the same variance-covariance matrix, then λ(x) is just a rescaled version of the LDF. The other approach, the so-called diagnostic paradigm, is to model p(x) but to leave g(x) free to adapt to the xs which happen to be observed in the data. We adopt the second approach here, and attach special importance to the linear logistic model λ(x) = β x.
(7)
Equivalently, the linear logistic model asserts that
p(x) = pL (β x) = 3
eβ x . 1 + eβ x
(8)
For (7) and (8) to make sense we need to include an intercept term in β x, so we will assume that there are m covariates with x = (x0 , x1 , · · · , xm ) with x0 = 1. We will impose an important restriction on D(β) in (1), namely that if the logistic model in (7) is correct then D(β) is minimised at the true value of β. This consistency with the logistic model explains the term “logistic-type” in the title of this paper. We show in Section 2.1 below that this imposes a rather natural restriction on the choice of functions U and V . We emphasise that the method we propose is essentially nonparametric — the (U, V ) functions determine the risk function we are minimizing and not any particular parametric model. Our view is that the logistic model is so central to this area that any sensible method for estimating a risk score β x should have the property that it gives the right answer if the logistic model happens to be true. The rest of this paper is organised as follows. Section 2 discusses the risk function D(β) in more detail. We explain the close connection with the Neyman Pearson Lemma (Section 2·2) and with ROC curves (Section 2·3). Section 2·4 looks at some special cases as mentioned earlier. Section 3 looks at how a sample of values of y and x can be used to estimate β by minimizing the sample analogue of (1). The estimates are essentially adaptively re-weighted logistic regression, and so can be fitted using standard software. If the model (7) is true then unweighted logistic regression is optimal (Section 3·1). But in practice we can never be sure that this or any other model is true — a practical procedure allowing for model misspecification is suggested in Section 3·2. It is well known that crude (retrospective) error rates in discriminant analysis can be very misleading as measures of performance, and that cross-validation or “hold-one-out” error rates provide a more reliable guide. Section 3·3 gives a simple approximation to the cross-validation risk for D(β). Section 4 tests out the method by estimating β in two examples, simulated to have different patterns of misspecification of the logistic model. The resulting values of β, and hence the resulting discriminant functions, can be quite different from standard logistic regression, and can depend in a non-trivial way on the objective of the analysis (on the choices of U and V ). Sections 3 and 4 assume that the data are a random sample from the (y, x) population over which the risk function is defined. Section 5 extends the method to other sampling schemes, such as case-control sampling in epidemiology. We show that the estimates are still reweighted logistic regression, but with weights now depending on the sampling scheme. Using the terminology McCullugh & Nelder (1989), an offset on the score β x may also be involved. Section 6 looks at data from a study of statistical diagnosis in breast cytolgy, and discusses two versions of (1), firstly to maximize the area under the ROC curve, and secondly to find the screening score which minimizes the rate of false negatives for a suitably small rate of false positives. Some concluding comments are given in Section 7. 4
There is a vast literature on discriminant analysis and on logistic regression, the two basic methodologies behind our paper. Articles by Hand (on Discriminant Analysis) and by Lemeshow & Hosmer (on Logistic Regression) in Encyclopedia of Biostatistics (Armitage & Colton, 1998) are good entries into the literature on these topics. There are also several good texts, for instance Lachenbruch (1975), McLachlan (1992), Hand (1981) and Collett (1992). For ROC curves and the assessment of screening methods, the text by Green & Swets (1988), the review articles by Metz (1978) or Zweig & Campbell (1993), or the more technical paper by Girling (2000), could be consulted. The use of cross validation in discriminant analysis is reviewed by Stone (1978) & Hand (1997), amongst many others. Some statistical aspects of credit scoring are discussed in Hand & Henley (1997). 2. The Dw risk function 2·1. Consistency under the logistic model From (1), and using the notation in Section 1, D(β) = − and so
U(β x)p(x)g(x)dx +
V (β x){1 − p(x)}g(x)dx
∂U(β x) ∂V (β x) x p(x)g(x)dx + x {1 − p(x)}g(x)dx ∂u ∂u ∂V (s) ∂U(s) ∂V (s) + , = −E x p(x) − ∂U (s) ∂u ∂V (s) ∂u ∂u + ∂u ∂u
∂D(β) =− ∂β
(9)
where s = β x. Now assume that the logistic model is true so that p(x) is equal to (8). Then if (9) is zero for all g(x), the last factor must be zero for all x which implies that ∂U(u) ∂V (u) = eu ∂u ∂u for all u. Thus if we define w(u) = ∂U(u)/∂u then ∂V (u)/∂u = eu w(u) and so we can write, up to arbitrary constants,
u
u
w(s)ds and V (u) =
U(u) = −∞
es w(s)ds.
(10)
−∞
Equation (9) then simplifies to ∂D(β) = −E [xW (β x) {p(x) − pL (β x)}] ∂β 5
(11)
where W (u) = w(u)(1 + eu ). Since U is a non-decreasing function, w is non-negative and can be thought of as a weight function. Taking different weight functions w in (10) defines our class of logistically consistent linear discriminant functions. 2·2. Restating the Neyman-Pearson Lemma There is a close connection between D(β) with U and V defined by equations (10), and the optimality of likelihood ratios established by the Neyman-Pearson Lemma. Following the usual proof of the Lemma, define the two regions A = {x : λ(x) > u} and B = {x : β x > u} for some fixed value of u. Then in A − B (the set of xs in A but not in B), λ(x) > u and so, using (6), u λ(x) π0 e g0 (x)dx ≤ π0 e g0 (x)dx = π1 g1 (x)dx. (12) A−B
Similarly, u
π0 e
A−B
B−A
g0 (x)dx ≥ π0
Subtract (13) from (12) to give u π0 e g0 (x)dx − But
A−B
B−A
A−B
gy (x)dx −
A−B
λ(x)
e B−A
g0 (x)dx = π1
g0 (x)dx ≤ π1
B−A
A−B
A
g1 (x)dx.
g1 (x)dx −
gy (x)dx =
B−A
B−A
(13)
g1 (x)dx .
gy (x)dx −
B
gy (x)dx
= pry {λ(x) > u} − pry (β x > u)
where the suffix y on pr denotes conditional probability given y. Hence if we define d(u) = π1 [pr1 {λ(x) > u} − pr1 (β x > u)] − π0 eu [pr0 {λ(x) > u} − pr0 (β x > u)]
(14)
then we have proved that d(u) ≥ 0 for all u. Essentially, the non-negativity of (14) is a restatement of the Neyman-Pearson Lemma. It follows that if we define, for some non-negative weight function w(u), ∞ Dw (λ, β) = d(u)w(u)du, (15) −∞
6
then Dw (λ, β) ≥ 0 with equality only when λ(x) = β x, i.e. when the logistic model (7) is correct. Now let gy∗ (λ) be the conditional probability density function of λ(x) given y. Then ∞ ∞ ∞ pry (λ(x) > u)w(u)du = w(u) gy∗ (λ)dλdu −∞ u=−∞ λ=u λ ∞ gy∗(λ) w(u)dudλ = λ=−∞ u=−∞ λ(x)
= Ey
w(u)du . −∞
Similar calculations for the other terms in (14) and (15) give Dw (λ, λβ ) = {π1 E1 U(λ(x)) − π0 E0 V (λ(x))) − (π1 E1 U(β x) − π0 E0 V (β x)} ,
(16)
where U and V are given by (10). If the logistic model is correct, we can find the true value of β by minimizing Dw (λ, β) over β. More generally, minimizing Dw over β gives the linear discriminant function β x which is as close as possible to the true log-odds function. Comparing (1) with (16), we see that minimizing Dw over β is exactly equivalent to minimizing D(β). 2·3. Dw and ROC For any continuous statistic s = s(x), the ROC curve, as mentioned in the Introduction, is the plot of pr1 (s ≥ u) against pr0 (s ≥ u) as u ranges over all possible values. ROC is a continuous curve in the unit square joining the points (0, 0) when u = +∞ to (1, 1) when u = −∞. If we imagine testing the null hypothesis that y = 0 against the alternative hypothesis that y = 1, using the critical region {x : s(x) > u}, then the coordinates of ROC are the Type I error (x-coordinate) and one minus the Type II error (y-coordinate). The standard form of the Neyman-Pearson Lemma therefore establishes that the true log-odds function λ is the statistic which maximizes the height of ROC all the way along the curve. Fig. 1 illustrates ROC for the statistics λ(x) (solid curve) and β x (dotted curve). If the logistic model fits reasonably well, β x will be a reasonable approximation to λ(x), and the two ROC curves will be close together. The slope of the ROC curve for λ is just the likelihood ratio, since ∂ ∗ g (x)dx pr (λ > u) g1 (u) π0 λ(x)=u 1 1 ∂u = = = eu ∗ ∂ g0 (u) π1 g (x)dx pr0 (λ > u) λ(x)=u 0 ∂u
(17)
from (6). Now consider the two points A and B in Fig. 1. These are vertically above each other, so for some u and u pr0 (λ(x) > u ) = pr0 (β x > u). 7
Using (17), the vertical distance between A and B is pr1 (λ(x) > u ) − pr1 (β x > u)
π0 u
pr1 (λ(x) > u) + e {pr0 (λ(x) > u ) − pr0 (λ(x) > u)} − pr1 (β x > u) π1 1 = d(u) ≥ 0 π1 where d(u) was defined in (14). The approximation here assumes that u and u are close together, or that β x is a good approximation to λ(x). This shows that, up to a constant factor, d(u) is just the vertical distance between the ROC curves for λ(x) and for β x. Hence Dw defined in (15) can be thought of as the weighted area separating the two curves. The Dw method, minimizing Dw (λ, β) (or minimizing D(β)) over β is finding the linear statistic β x whose ROC is a close as possible to that of the true (and unknown) log-odds statistic. The role of w(u) is to specify the weight to be given to each point on the curve. 2·4. Some special cases of w(u) under the logistic model In this section we assume that the logistic model (7) is true, and look at some examples of the function w(u). These correspond to familiar, and potentially useful, problems in discriminant analysis.
Example 1. w(u) = w0 (u) = 1/(1 + eu ) . Here the indefinite integrals of w(u) and eu w(u) are, up to arbitrary constants, eu 1 U(u) = log and V (u) = − log . u 1+e 1 + eu Thus D(β) in (2) can be written as an arbitrary constant minus E {y log pL (β x) + (1 − y) log(1 − pL (β x))} .
(18)
The empirical version of (18) is just the log likelihood function for logistic regression of y on x. Thus the Dw method includes standard logistic maximum likelihood as the special case when w = w0 .
Example 2. w(u) = δ(u − u0 ) . Here the weight function is the Dirac delta function at u = u0 for some fixed value u0 . In this case U(u) = H(u − u0 ) and V (u) = eu0 H(u − u0 ) 8
where H is the Heaviside function H(u) = 1 if u ≥ 0 and zero otherwise. Then D(β) = −π1 pr1 (β x ≥ u0 ) + π0 eu0 pr0 (β x ≥ u0 ).
(19)
Now consider the classification rule yˆ = 1 if β x ≥ u and yˆ = 0 otherwise, for some threshold u, and let the cost of classifying y as yˆ be cyˆy . Any sensible cost matrix will give a higher cost to a wrong decision than to a right decision, and so we can assume that c10 > c00 and c01 > c11 . The expected cost of yˆ is π0 (c00 pr0 (β x < u) + c10 pr0 (β x ≥ u)) + π1 (c01 pr1 (β x < u) + c11 pr1 (β x ≥ u)) = c00 π0 + c01 π1 + (c01 − c11 )C(u),
(20)
C(u) = −π1 pr1 (β x ≥ u) + π0 eu0 pr0 (β x ≥ u)
(21)
where
and
u0 = log
c10 − c00 c01 − c11
.
(22)
It is easy to show that (20) is least when u = u0 , since using (17) dC(u) = π1 g1∗ (u) − π0 eu0 g0∗ (u) du = π0 g0∗ (u)(eu − eu0 ), which is zero when u = u0 . Comparing (19) and (21) we see that D(β) is just equal to the value of C(u) at the optimum threshold. Hence minimizing expected cost is equivalent to minimizing Dw in the special case when w(u) = δ(u − u0 ) and u0 is taken to be (22). If c01 = c10 = 1 and c00 = c11 = 0, so that u0 = 0, this is just the hit-rate example mentioned in Section 1. The credit scoring example mentioned in Section 1 is given by c01 = c00 = 0, c11 = −a1 and c10 = a0 , so that in this case u0 is the log of the loss-to-profit ratio u0 = log a0 /a1 . Another application of this weight function is when u0 is determined by pr0 (β x ≥ u0 ) = α0 ,
(23)
for a given value of α0 . From Section 2·3, α0 is the horizontal coordinate of the ROC curve, and so minimizing Dw is maximizing the height of the curve at this point. In screening problems, we are finding the linear score which gives the smallest false negative rate whilst controlling the false positive rate at the predetermined level α0 . 9
Example 3. w(u) = g0∗ (u) . This weight function is the conditional probability density function of β x given y = 0. In this case, if G∗y (u) is the conditional cumulative probability function of β x given y, we have U(u) = G∗0 (u) and u π1 u ∗ π1 s ∗ e g0 (s)ds = g1 (s)ds = G∗1 (u) (24) V (u) = π0 −∞ π0 −∞ from (17). Thus in this case D(β) = −π1 E1 G∗0 (β x) + π1 E0 G∗1 (β x). But E1 G∗0 (β x)
∞
= −∞
G∗0 (u)dG∗1(u) = A,
the area under the ROC curve, and ∞ ∗ ∗ ∗ G1 (u)dG0 (u) = 1 − E0 G1 (β x) = −∞
∞
−∞
(25)
G∗0 (u)dG∗1(u) = 1 − A.
(26)
Hence D(β) = π1 (1 − 2A). Thus maximizing A is equivalent to minimizing D(β) with w = g0∗ . This can also be deduced from the geometrical argument in Section 2·3. Maximizing the area under the lower of the two curves in Fig. 1 corresponds to minimizing the area between them. As the horizontal coordinate of ROC is 1 − G∗0 (u), we are minimizing −∞ ∞ ∗ d(u)d[1 − G0 (u)] = d(u)g0∗(u)du +∞
−∞
which, from (15), is just Dw with this particular choice of w. 1
1
Example 4. w(u) = e− 2 u . Here, up to arbitrary constants, U(u) = −e− 2 u and 1 V (u) = e 2 u and so (2) becomes
1 1 1 E ye− 2 β x + (1 − y)e 2 β x = E e− 2 (2y−1)β x . (27) This is the risk function behind the “AdaBoost” method of Schapire et al. (1998). See Vapnik (1999) for an account of this and related ideas in the machine learning literature. Allowing for the differences in notation, the right hand side of (27) is in fact exactly the same as equation (5·55) of Vapnik’s text (1999, p.164), who goes on to prove that (27) is logistically consistent in the sense of our more general discussion in Section 2·1. Further, 10
the maximum value of (27), attained when β is the correct value for the assumed logistic model, can be written 1 2 {π0 π1 g0 (x)g1 (x)} 2 dx. (28) This is the “Matusita affinity”, see McLachlan (1992, pp. 23-25). Notice that (28) is 1 1 equal to two minus {(π0 g0 ) 2 − (π1 g1 ) 2 }2 dx, the squared Hellinger distance between π0 g0 and π1 g1 . 3. The Dw Method In practice, given a sample (yi, xi ) with i = 1, · · · , n, we minimize the empirical versions of the above risk functions, replacing population expectations E by empirical ¯ The empirical version of D(β) in (2) is expectations (sample averages) E. ¯ D(β) = E¯ {−yU(β x) + (1 − y)V (β x)} n 1 = {−yi U(β xi ) + (1 − yi )V (β xi )} , n i=1 and the empirical version of Dw in (16) is ¯ [y {U(λ(x)) − U(β x)} − (1 − y) {V (λ(x)) − V (β x)}] . ¯ w (λ, β) = E D Differentiating with respect to β gives the empirical version of (11), ¯ ¯w ∂D ∂D = = −E¯ [xW (β x) {y − pL (β x)}] . ∂β ∂β
(29)
We define βˆw to be the zero of (29). Now the standard equations for maximum likelihood for logistic regression of y on x can be written E¯ [x {y − pL (β x)}] .
(30)
Comparing (29) and (30) we see that finding βˆw is just fitting a logistic regression with weights W (β x). Since these weights depend on the unknown β, this has to be done iteratively. We start with the unweighted logistic regression estimate, βˆ0 say, fit weighted logistic regression with weights W (βˆ0 x) to give βˆ1 , refit with weights W (βˆ1 x) to give βˆ2 , and so on. Assuming that software for logistic regression allows for the specification of case weights, this is easy to carry out — in all of our examples we have found that only three or four iterations are needed for adequate convergence. 3·1. Sampling properties under the logistic model 11
If the logistic model (7) is correct, then it is relatively straightforward to obtain the asymptotic sampling distribution of βˆw . Assuming that the data are a large random sample from the joint distribution of y and x, we find 2¯ ∂ Dw (31) E = E {W (β x)p(x)(1 − p(x))xx } = Jw , say. ∂β∂β By the usual asymptotic arguments, ¯w ∂D . βˆw β + Jw−1 ∂β
(32)
¯ w /∂β has mean zero and But under the logistic model (7), and using equation (29), ∂ D variance n−1 Vw where Vw = E W 2 (β x)p(x)(1 − p(x))xx . Hence E(βˆw ) β and var(βˆw )
1 −1 J Vw Jw−1 . n w
Note that the special case of ordinary logistic regression has w = w0 and W = 1, giving Vw0 = Jw0 . Hence var(βˆw0 )
Now
1 −1 J . n w0
1 2
Jw = E {p(x)w(β x)xx } = E {p(1 − p)} x · But
Jw0 = E{p(1 − p)xx } and Vw = E
p 1−p
12
p w 2 xx 1−p
wx .
and so the Cauchy Schwartz inequality gives Vw ≥ Jw Jw−10 Jw
(33)
and so var(βˆw ) ≥ var(βˆw0 ). By these two matrix inequalities we mean that the difference between the two sides is nonnegative definite. This means that when the logistic model holds, the ordinary unweighted logistic estimate is asymptotically optimal. 12
This asymptotic optimality also carries over to risk functions based on Dw , namely ˆ = EDw (β x, βˆ x). Rw (β, β) But, following (31), Rw is locally quadratic,
ˆ ˆ ˆ Rw (β, β) E (β − β) Jw (β − β) . It follows that for any weight functions w and w1 , Rw1 (β, βˆw ) ≥ Rw1 (β, βˆw0 ).
(34)
We have assumed in the above that the matrix Jw in (31) is non-singular. However it can be shown that the derivations leading to (34) also hold if Jw−1 in (32) is taken as a generalized inverse. This covers the case, for example, of the weight function in Example 2 of Section 2·4. 3·2. Estimation for near-logistic models This asymptotic optimality of βˆw0 depends critically on the logistic modelling assumption that λ(x) = β x. In practice we can never be sure that this or any other model holds exactly, and so we must always allow for the possibility (indeed, the certainty!) that the model is misspecified. We will assume, however, that the covariates have been chosen sensibly so that the linear score β x does not grossly violate the evidence in the data. By a “near-logistic” model we mean that there is a value of β such that β x is reasonably close to the actual log odds function λ(x). This is equivalent to the assumption made in Section 2·3, that the ROC curves for λ(x) and β x are close together. When the logistic model is misspecified, the risk function
ˆ = EDw (λ, β) ˆ Rw (λ, β)
(35)
reflects not only the variance of βˆ but also inherits the bias Dw (λ, βw ) = min Dw (λ, β). β
Of course under the logistic model this bias is zero since there is a true value of β for which λ(x) = β x = βw x for all w. In the near-logistic setting, βˆw0 is still better than βˆw on grounds of variance, but has a greater bias since, for all w = w0 , Dw (λ, βw0 ) > Dw (λ, βw ). 13
In practice, in finding βˆ to minimize (35) we need to compromise between βˆw0 , which is best for variance, and βˆw , which is best for bias. We can do this as follows. Suppose the context of the problem suggests a particular weight function w1 , so that estimates should be judged by (35) with w = w1 . Consider the estimates βˆwα for the family of weight functions wα (u) = (1 − α)w0 (u) + αw1 (u),
(36)
for 0 ≤ α ≤ 1. Then the value of α which minimizes the risk of βˆwα will tend to 1 as n → ∞ (when bias is predominant), but tends to 0 as n decreases (when variance is predominant). We show in the Appendix that for a finite sample size the risk Rw1 (λ, βˆwα ) is least for some intermediate value of α between 0 and 1. 3·3. Cross-validation risk Since λ is unknown, we cannot calculate the actual risk (35), but we do have the value ˆ for any given estimate β. ˆ It is tempting to use this as a way of judging the per¯ of D(β) formance of different discriminant functions. However, this is a retrospective assessment, using the data both to derive an estimate and to judge its performance, and so is open to the dangers of over-fitting. To illustrate the problem with an artificial example, suppose we have a continuous covariate t. Let A be the set of values of t observed in the cases with y = 1 and let B be the set of values of t observed amongst the cases with y = 0. Then define the transformed covariate x to be 1 if x ∈ A, −1 if x ∈ B and 0 otherwise. Then, for any positive β, the discriminant function βx will divide the observations perfectly between the cases with y = 1 and the cases with y = 0, but will be useless if applied to any other set of data. This is an extreme example, but illustrates the point that there is always a danger of over-reacting to idiosyncrasies in a particular data set. The true worth of a discriminant function must be judged by its ability to discriminate future observations. In practice, cross-validation is usually used to imitate the results of such a predictive assessment. For our given weight function w1 (u), the crude retrospective assessment of βˆw is based on
¯= 1 D −yi U1 (βˆw xi ) + (1 − yi )V1 (βˆw xi ) , n i
(37)
where in this (and in subsequent expressions) the suffix 1 on U, V and W means that these functions are calculated with respect to the weight function w1 . Now suppose that (−i) βˆw is calculated from the same data but with the ith observation removed, giving βˆw , say. Then the cross-validated version of (37) is 1 (38) CV = −yi U1 (βˆw(−i) xi ) + (1 − yi)V1 (βˆw(−i) xi ) . n i 14
From (29), these estimates are given by
W (βˆw xj ) yj − pL (βˆw xj ) xj = 0 j
and
(−i) (−i) ˆ ˆ W (βw xj ) yj − pL (βw xj ) xj = 0.
j=i
Using first order approximations similar to (32), we find 1 ui)(yi − pˆi )xi βˆw(−i) − βˆw − Jˆw−1 W (ˆ n
(39)
where uˆi = βˆw xi , pˆi = pL (βˆw xi ), and, following (31), 1 W (ˆ ui)ˆ pi (1 − pˆi )xi xi . Jˆw = n i Writing U1 (βˆw(−i) xi ) − U1 (βˆw xi ) w1 (βˆw xi )(βˆw(−i) − βˆw ) xi
(40)
and a similar expression for V1 , using (39) and then substituting into (37) and (38), leads to 1 ¯ CV D(w) + 2 W1 (ˆ ui )W (ˆ ui)(yi − pˆi )2 xi Jˆw−1 xi . (41) n i Note that the cross-validation correction term in (41) is of order O(n−1 ), but the presence of the quadratic form in the summation suggests that it will tend to increase with m (the number of covariates). Given the target weight function w1 , we can find βˆw for w = wα in (36). In the terminology of Stone (1974), the cross-validatory choice of α is then given by the value of α minimizing (41). 4. Examples In this section we revisit two of the special cases for w(u) in Section 2·4, and discuss two simulated examples. 15
4·1. Maximizing the area under ROC : Example 1 Following Example 3 of Section 2·4, the target weight function w1 (u) for maximizing the area under ROC is the conditional probability density function of β x given y = 0. Given sample discriminant scores uˆi = βˆw xi , U1 (u) and V1 (u) in (10) are therefore estimated by (1 − yi)H(u − uˆi ) (1 − yi )euˆi H(u − uˆi ) ˆ ˆ , V1 (u) = . U1 (u) = (1 − yi ) (1 − yi ) Substituting these into (37) gives ¯= D
n
1 H(ˆ uj − uˆi )(1 − yi)(−yj + {1 − yj )euˆi }. (1 − y ) i i ij
To calculate βˆw for w = wα in (36), we need an explicit estimate of w1 (u). Kernel density estimation with a Gaussian kernel and bandwidth + gives 1 uˆi − u wˆ1 (u) = (1 − yi )φ , (42) + (1 − yi ) + where φ is the standard normal probability density function. The cross validation formula (41) assumed that the weight functions are fixed. Here, they are estimates through (42). However it can be shown that the data-dependence in w, ˆ affect only second and the fact that βˆ(−i) uses a slightly different weight function from β, order terms in the asymptotic approximations of Section 3·3, and so (41) remains valid to first order. Of course (38) could be calculated directly, but this would be impractical if n is large. To test out the methodology, suppose there are 2 covariates x1 and x2 , and that n = 200, the first 100 observations having y = 0 and the remaining observations having y = 1. Initially, all the covariates values are independent samples from N(0, 1), but we simulate a gross outlier by changing the first value of x1 into x1 = 100. We then add a positive constant onto the second 100 values of x1 , chosen such that ¯ 1 |y = 0) = E(x ¯ 1 |y = 1). E(x
(43)
The point of this example is that, because of (43), the ordinary logistic estimate βˆ = (βˆ0 , βˆ1 , βˆ2 ) will have βˆ1 = 0 (this is easily seen from the form of (30)). The resulting score will be useless, as it will merely reflect the random errors in x2 . But x1 is quite informative, since its values with y = 1 will tend to be larger than the values with y = 0 because of the positive constant added as described. In fact the (empirical) area under the ROC of x1 is 0·76, but the area under the ROC of x2 is only 0·53. These two ROC 16
curves are shown in Fig. 2 (left graph). To maximize the area under ROC for these data, we need to put most weight on x1 , whereas ordinary logistic regression puts most weight on x2 . In this example the smoothness of the weight functions is not very important, and so we choose + subjectively to be the smallest bandwidth for which (42) is a reasonably smooth density estimate. Here we take + = 0·1. Fig. 2 (right graph) shows the estimates ¯ and CV against α. The main feature is the sharp drop in risk when α reaches about of D 0·2. At this value, the estimate essentially switches from having most weight on x2 to having most weight on x1 . The area for βˆw1 is indistinguishable from that for x1 . Fig. 2 shows that this example can hardly be described as near-logistic, although the problem is caused by just one observation. If the outlier is removed then logistic regression fits very well. This is an extreme example, but in practice if we have a large number of covariates then it is perhaps possible that a recording error of this magnitude might not be spotted. Here, the outlier has led to logistic regression choosing the wrong covariate. 4·2. Estimation for fixed u0 : Example 2 Following Example 2 of Section 2·4, suppose that w1 (u) gives all weight to a single fixed value u = u0 . Here D(β) is given by (19), which is estimated by ¯ = 1 ui − u0 ). D {−yi + (1 − yi )eu0 } H(ˆ n
(44)
(−i) For the cross validation risk, suppose that βˆw xi is βˆw xi + ei , where from (39),
ei −n−1 W (ˆ ui )(yi − pˆi )xi Jˆw−1 xi . Then a direct estimate of CV is 1 CV = ui + ei − u0 ). {−yi + (1 − yi )eu0 } H(ˆ n
(45)
(46)
This approximation uses the first order asymptotics in (39) but avoids the subsequent approximation (40) because in this case U1 is not a continuous function. For the same reason, we cannot use (29) directly to find βˆw for w = wα in (36). One approach is to replace w1 by a continuous Gaussian kernel centred on u0 , and then to study the dependence of the risk of βˆw on both α and the bandwidth of this kernel. A simpler approach comes from noting the essential feature of (36), a smooth graduation between w0 when α = 0 and the Dirac delta function at u0 as α → 1. This also holds for the alternative family of weight functions 1 α(u − u0 ) (2π) 2 w0 (u) ∗ φ . (47) wα (u) = 1−α 1−α 17
Again w0∗ = w0 , and w ∗ puts increasing weight on u0 as α → 1. For any given 0 ≤ α < 1 we take w = wα∗ , calculate βˆw and hence estimate the quantities required in (44), (45) and (46). For a numerical example, take n = 100 and m = 2, with the pairs (x1 , x2 ) uniformly distributed over the unit square. The ys are simulated from the model p(x) = pr(y = 1|x) =
x21 . x21 + x22
(48)
By symmetry, the best linear logistic model has contours parallel to the main diagonal of the unit square, but this is misspecified since the true contours of p(x) in (48) are lines radiating from the origin. The data are illustrated in Fig. 3 (left graph). The non-logistic nature of these data is not obvious from the plot, and is not detected by fitting a linear logistic with an added interaction term, for instance. The right graph in Fig. 3 plots (44) and (46) against α, for the weight function (47). We have taken u0 = 2, which means that we are particularly interested in finding a discriminant score which fits the data well towards the lower right corner of the unit square. The graph shows the range of α from 0·1 to 0·6. For α < 0·1 the values of both functions are indistinguishable from their values at 0·1: essentially this is the linear logistic estimate. For α > 0·6 the variance of the estimate increases, giving very high ¯ and CV suggest that the best value is around α = 0·5, values to these risks. Both D giving a risk noticeably better than the ordinary logistic model. Notice that both in this example, and in Section 4·1, the inference from CV is es¯ There were only 2 covariates, giving little scope for sentially the same as that from D. over-fitting. The use of CV becomes important if we have a large number of covariates and/or a small sample size, particularly when we are comparing models involving different transformations of these covariates. 5. The Dw Method for Different Sampling Schemes An important assumption made in Sections 3 and 4 is that the data are randomly sampled from the population of interest, so that the observations (yi, xi ) are representative of the distribution of (y, x) over which the risk function (2) is defined. In many applications this assumption needs to be relaxed. In medical screening, for example, we may be interested in the performance of the score β x over the general population, but the data for analysis may be from a cohort study (sampling conditional on x) or a case-control study (sampling conditional on y). It is well known that if the logistic model is assumed to be true then logistic regression can still be used on data from such sampling schemes, with at most an unimportant change to the intercept term (see the discussion in Carroll et al., 1995). But when the model is mis-specified the relationship between the sample and 18
the population needs to be described more carefully. We show that, provided the sampling scheme is fully specified, the Dw method continues to apply but with an appropriately modified weight function. Consider the (y, x) population over which the risk function (2) is defined. Suppose that each member of this population has a probability of being selected for the sample, so that if C is the event of being chosen pr(C|y, x) ∝ q(y, x) where q(y, x) is assumed to be known up to a constant of proportionality. Then pr(y = 1|x, C) = a(x) + λ(x) pr(y = 0|x, C)
(49)
where a(x) = log
q(1, x) . q(0, x)
If β is the solution to the population Dw equation (11) then, trivially, E xW (β x) {q(y, x)}−1 {(y − pL (β x)}|C = 0.
(50)
From (49) the left hand side of (50) can now be written as
xW ∗ (x)(y − pL (a(x) + β x)), where
W (β x){1 + ea(x)+β x } . W (x) = q(1, x)(1 + eβ x ) ∗
Hence if the data (yi, xi ) are sampled according to q(y, x), and we define b(β, x) = a(x) + β x, then i
xi
W {b(β, xi) − a(xi )}{1 + eb(β,xi ) } [yi − pL {b(β, xi )}] = 0 q(1, xi ){1 + eb(β,xi )−a(xi ) }
(51)
is an unbiased estimating equation for β. Notice that equation (51) is still weighted logistic regression, but with a modified weight function, and with the known function a(x) added to β x as an offset (McCullagh & Nelder, 1989). For random sampling, q(y, x) = 1, a(x) = 0, and so (51) is the same as (28), as expected. For cohort sampling we assume that for some known sampling intensity function 19
q(x), q(1, x) = q(0, x) = q(x) and so a(x) is also zero. In this case (51) is the same as (28) but with W (β x) replaced by W (β x)/q(x). We just weight inversely with the probability of selection, as expected. For case-control sampling we assume that for two sampling fractions q0 and q1 , q(1, x) = q1 , q(0, x) = q0 , and so a(x) is just the constant a = log
q1 . q0
In this case the estimating equation is
xi W (β xi )
i
1 + ea+β xi {y − pL (a + β xi )} = 0. 1 + eβ xi
(52)
To explore the form of the case-control weight function in (52), let us revisit the two examples of Section 4. Firstly, suppose we wish to use case-control data to maximize the area under ROC. In practice we would first fit a standard logistic regression which, under case control sampling, would be estimating β ∗ with β ∗ x = β x + a, where β is the corresponding parameter which would be estimated by logistic regression under random sampling. We then form the weight function ∂ ∂ pr(β ∗ x = u|y = 0) = pr(β x = u − a|y = 0) = g0∗ (u − a) ∂u ∂u and redefine β ∗ by the equation ∗ E xg0∗ (β ∗ x − a)(1 + eβ x ){y − pL (β ∗ x)}|C = 0. But this is proportional to −π0 q0 E{xg0∗ (β ∗ x − a)eβ
∗ x
|y = 0} + π1 q1 E{xg0∗ (β ∗ x − a)|y = 1}
which is proportional to −π0 E{xg0∗ (β ∗ x − a)eβ
∗ x−a
|y = 0} + π1 E{xg0∗ (β ∗ x − a)|y = 1}
= −π0 E{xg0 (β x)eβ x |y = 0} + π1 E{xg0 (β x)|y = 1} = E xg0 (β x)(1 + eβ x ){y − pL (β x)} . This is just minus one times (11). Hence the case control estimate of β is exactly the same as for random sampling, except for the addition of a to the intercept. Of course an additive shift to the score β x does not effect the ROC curve or its area, and so this invariance is what we would expect, ROC depending on the joint distribution of (y, x) only through the conditional distributions of x given y. 20
Secondly, corresponding to Section 4·2, suppose we are minimising misclassification cost at fixed u = u0 , so that w(u) = δ(u − u0 ). Then under case-control sampling W ∗ (x) ∝ δ(β x − u0 ) = W (β x). Thus in this case the weight function W for case control sampling is again exactly the same as for random sampling. 6. Breast Cancer Example Wolberg & Mangasarian (1990) discuss a study of medical diagnosis applied to breast cytology, in which a number of measured characteristics of tumours were used to predict whether these tumours were malignant or benign. The data arose from a series of clinical cases at the University of Wisconsin Hospitals in Madison, U.S. See the cited paper for a full discussion of the clinical setting and the variables available for study. Note that these data can be down-loaded from the Internet at
www.ics.uci.edu/ mlearn/MLSummary.html In this data set there are n = 683 tumours (observations) of which 444 were benign (y = 0) and 239 were malignant (y = 1). There were m = 9 tumour measures (covariates x), each coded on the scale 0 to 10. Initial analysis of these data shows that there is very good separation between the two groups, so that quite accurate diagnosis is possible in this case. The logistic score of y on x gives an ROC curve with a remarkably high area, A = 0·9963. To illustrate the above theory, we first ask whether it is possible to improve on the logistic score in the sense of maximising the area under ROC. Following Section 4·1, we take + = 1 as our subjective choice of the smoothing parameter needed to give a visually smooth estimate of the density in (42), and then form the corresponding discriminant functions βˆw x with wα given by (36). As expected, the discriminant with α = 1 does give a slightly higher (retrospective) ROC area than that with α = 0. Fig. 4 (left graph) ¯ and CV against α. The curve for D ¯ is slightly lower at α = 1 than at α = 0 plots D as expected. However the curve for CV increases as α increases, suggesting that the apparent gain over straightforward logistic regression is probably spurious. Both curves are in any case quite flat. Turning to a Dw analysis for screening, the value u0 = 1·5 gives a false positive rate under the logistic score of slightly less than 1%. It is of interest to find the linear score β x which maximizes the corresponding true positive rate. Following Section 4·2, Fig. 4 (right 21
¯ and CV . As in the earlier artificial example, graph) plots the corresponding values of D both risks rise sharply as α increases towards 1. But both retrospective and prospective risks decrease initially, taking a minimum at about α = 0·15 The corresponding βˆw x is therefore the score with the best performance at this particular threshold. Further study of these data shows that the optimum βˆw in the screening analysis depends on the choice of u0 , suggesting that these scores have different statistical properties at different points of their range. This is confirmed by smooth density plots of the two conditional distributions of the logistic score given y. Both seem quite close to normal, but with substantially different variances. This suggests that it might be useful to investigate quadratic discriminant scores or other non-linear transformations of these covariates. 7. Comments It is worth noting that the classical LDF does not belong to the Dw class. If g1 (x) and g0 (x) are multivariate normal with a common variance-covariance matrix, then the logistic model holds exactly and LDF is, up to a linear transformation, a consistent estimate of the true log odds score β x. But this does not hold in the more general setting when the logistic model holds without the stronger Gaussian assumptions. Section 2·4 looked at some special cases for w(u) and evaluated the corresponding risk functions under the logistic model. The logistic model, however, is not needed for the first two examples of Section 2·4. If the model is misspecified in Example 1 of Section 2·4, for instance, we can still define the risk function C(u) in (21), but it is helpful now to write this more explicitly as C(β, u). The best value of β for w(u) = δ(u − u0 ) is given by ∂C(β, u0 )/∂β = 0, or x {p(x) − pL (u0 )} g(x)dx = 0. (53) βw x=u0
But for a general threshold u we have ∂C(β, u) = (1 + eu0 ) ∂u
β x=u
{p(x) − pL (u0 )} g(x)dx
which is zero when u = u0 and β = βw from the first (intercept) component of the vector equation (53). Thus minimizing the risk over both u and β gives the same value of β as minimizing D in (19). Example 2 of Section 2·4 shows that if w = g0∗ then, under the logistic model, D equals π1 (1 − 2A). If the model is misspecified, however, this no longer holds exactly. Equation (25) does hold exactly when g ∗ is the probability density function of β x (now different from that of λ(x)). But (24), and hence (26), rests on the assumption that g ∗ is the probability density function of the true log odds statistic, and so will only hold 22
approximately in the near logistic case. Informally, one would not expect the two versions of g ∗ , for λ(x) and for β x, to differ very markedly, and weighted regression estimates are typically quite insensitive to small changes in the case weights. Although the arguments of the functions U and V in (1) and elsewhere depend on β, Section 2·1 assumes that these functions themselves are fixed. If U and V depend on β, as in the case of the ROC weight function, ∂D/∂β has another term which we have ignored. However it can be shown directly that the equations (11) and ∂A/∂β = 0 are approximately the same in the near-logistic case. We have only given an informal definition of the term “near-logistic”. A rigorous treatment of Section 3·2 would need bounds on ε(x) = λ(x) − β x,
(54)
measuring the difference between the true probabilities p(x) and the best fitting logistic probabilities pL (β x). The approximations used in the Appendix below are valid when (54) 1 is O(n− 2 ). This implies that squared biases in βˆw are of the same order of magnitude as sample variances, justifying the locally quadratic asymptotic approximations to the risk functions. Essentially, this means that model misspecification is of the same order of magnitude as sample variability, which seems sensible if in the initial analysis of the data we “reject” any model which gives a “significant” value for some appropriate goodnessof-fit statistic. Section 3·2 suggests the class of weight functions wα in (36), but Example 2 of Section 4·2 uses the alternative form wα∗ in (47). In fact the existence of an optimum α between 0 and 1 also holds for the second class. Another family of risk functions for this case could be defined by √ α(u − u0 ) 2πα ∗∗ φ wα = (1 − α)w0 + . 1−+ 1−+ For fixed 0 ≤ + ≤ 1, optimum α exists as for (36), at α(+) say. Further, the risk Rwα() has optimum + in the closed interval (0, 1), so the joint optimum (α, +) also exists. Similarly, we can fix a smooth constraint between + and α, such as + = α as implied in (47). For ease of notation we have assumed that x is continuous, but there are no essential changes to the Dw method if the covariates are discrete. Some care, however, is needed in defining ROC in the discrete case, since the “curve” then becomes a series of points. Conventionally, the points are joined by straight lines in order to define the area A. If the score β x takes values on the discrete set u1 , u2, · · · , then the integral in (5) is replaced by A= H ∗ (ui − uj )pr1 (β x = ui)pr0 (β x = uj ), ij
where H ∗ (u) is defined to be 1 if u > 0,
1 2
if u = 0 and 0 if u < 0. 23
Acknowledgements We are grateful to Dr William H. Wolberg for access to the data used in Section 6. Part of this work was supported by the Visiting Fellowship Program of the Japan Society for the Promotion of Science. Appendix The proof in the appendix establishes that, for near-logistic models, the risk for the weight function wα in (36) is optimum for a value of α in (0, 1). Let βα = argminβ Dwα (λ, β). Then, following Section 3·1, we have 1 Dw1 (λ, βα ) = Dw1 (λ, β1 ) + (βα − β1 ) Jw1 (βα − β1 ). 2 From the definitions of βα and β1 as the zeros of (11), we also have βα − β1 = Jw−11 (Eεpwαx)|β=β1 = (1 − α)Jw−11 (Eεpw0x), where ε = ε(x) is defined in (54). Hence Dw1 (λ, βα ) = Dw (λ, β1 ) + (1 − α)2 ∆2 , where ∆2 = (Eεpw0x) Jw−11 (Eεpw0x). This is based on the above linear approximation of βα −β1 in terms of the misspecification 1 ε(x), and so is assuming that ε is of the order of magnitude O(n− 2 ). Our risk function Rw1 (λ, βˆα ) = EDw1 (λ, βˆα ) can be expanded up to terms of order O(n−1 ) to give Dw1 (λ, βα ) +
∂ 1 Dw1 (λ, β)|β=βα E(βˆα − βα ) + tr{Jw1 var(βˆα )} ∂β 2
1 1 −1 2 2 −1 −1 −1 = Dw1 (λ, β1 ) + tr{Jw1 Jw0 } + (1 − α) ∆ + tr Jw1 (Jwα Vwα Jwα − Jw0 ) . 2n 2n The first term in brackets is a fixed bound on the risk for all α. Now tr Jw1 (Jw−1α Vwα Jw−1α − Jw−10 ) = α2 tr Jw1 Jw−1α RJw−1α , 24
where R = Vw1 − Jw1 Jw−10 Jw1 is positive definite from (33). Hence the second bracketed term is ϕ(α) = (1 − α)2 ∆2 +
α2 −1 tr Jwα Jw1 Jw−1α R . 2n
This gives ϕ (α) = 2(α − 1)∆2 (λ, β1 ) +
α −1 2tr Jwα Jw1 Jw−1α R 2n
−α tr{Jw−1α (Jw1 − Jw0 )Jw−1α Jw1 Jw−1α R} − αtr{Jw−1α Jw1 Jw−1α (Jw1 − Jw0 )Jw−1α R} , which implies that ϕ (0) = −2∆2 < 0 and ϕ (1) =
1 tr{Jw0 Jw−11 RJw−11 } > 0. n
Hence the minimum must be attained inside the interval (0, 1). References Armitage, P & Colton, T. (eds) (1998). Encyclopedia of Biostatistics. Wiley: Chichester. Begg, C. B., Satogopan, J. M. & Berwick, M. (1998). A new strategy for evaluating the impact of epidemiologic risk factors for cancer with applications to melanoma. J. Amer. Statist. Assoc., 93, 415-26. Berwick, M., Begg, C. B., Fine, J. A., Roush, G. C. & Barnhill, R. L. (1996). Screening for cutaneous melanoma by self skin examination. J. National Cancer Inst., 88, 17-23. Carroll, R. J., Wang, S. & Wang, C. Y. (1995). Prospective analysis of logistic casecontrol studies. J. Amer. Statist. Assoc., 90, 157-69. Copas, J. B. (1999). The effectiveness of risk scores: the logit rank plot. Applied Statistics, 48, 165-83. Collett, D. (1992). Modelling Binary Data. Chapman & Hall/CRC: Florida. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-88. Girling, A. (2000). Rank statistics expressible as integrals under P-P-plots and receiver operating characteristic curves. J. Roy. Statist. Soc, B, 62, 367-82. Green, D. M. & Swets, J. A. (1988). Signal Detection Theory and Psychophysics. California: Peninsular Publishing. 25
Hand, D. J. (1981). Discrimination and Classification. Wiley: Chichester. — (1997). Construction and Assessment of Classification Rules. Wiley: Chichester. Hand, D. J. & Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-41. Lachenbruch, P. A. (1975). Discriminant Analysis. Hafner: New York. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall: London. McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley: New York. Metz, C, E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8, 238-98. Schapire R., Freund, Y., Bartlett, P. & Lee, W. S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26, 1651-86. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc, B, 36, 111-47. Stone, M. C. (1978). Cross-validation: a review. Operationsforschungs, Series: Statistics, 9, 127-39. Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New York. Wolberg, W. H. & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, U.S.A., 87, 9193-6. Zweig, M. H. & Campbell, G. (1993). Receiver operating characteristic (ROC) plots — a fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39, 561-57.
List of Figures Fig 1. ROC for Lambda and BetaX. Fig 2. Example 1. Fig 3. Example 2. Fig 4. Breast Cancer Example. 26