Generalized Linear and Quadratic Discriminant ...

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://amstat.tandfonline.com/loi/uasa20

Generalized Linear and Quadratic Discriminant Functions Using Robust Estimates Ronald H. Randles , James D. Broffitt , John S. Ramberg & Robert V. Hogg To cite this article: Ronald H. Randles , James D. Broffitt , John S. Ramberg & Robert V. Hogg (1978) Generalized Linear and Quadratic Discriminant Functions Using Robust Estimates, Journal of the American Statistical Association, 73:363, 564-568 To link to this article: http://dx.doi.org/10.1080/01621459.1978.10480055

Published online: 06 Apr 2012.

Submit your article to this journal

Article views: 19

Citing articles: 6 View citing articles

Full Terms & Conditions of access and use can be found at http://amstat.tandfonline.com/action/journalInformation?journalCode=uasa20 Download by: [73.51.38.191]

Date: 27 January 2017, At: 23:55

Generalized Linear and Quadratic Discriminant Functions Using Robust Estimates RONALD H. RANDLES, JAMES D. BROFFITT, JOHN S. RAMBERG, and ROBERT V. HOGG*

Two new methods of constructing robust linear and quadratic discriminant functions are introduced. The first is a generalization of Fisher's procedure for finding a linear discriminant function. It places less weight on those observations that are far from the overlapping regions of the two populations. The second new method substitutes M-estimates of the means and the covariance matrices into the usual expressions for the linear and quadratic discriminant functions. Monte Carlo results indicate lower misclassification probabilities for these schemes compared to Fisher's linear discriminant function in cases of heavy-tailed or contaminated distributions.

where i and yare the sample means of the respective training samples and

is their pooled sample covariance matrix. The function D L ( .) is known as Fisher's linear discriminant function (LDF). If we do not assume that the two normal populations have the same covariance matrix, then estimating the unknown parameters in (1.1) yields a rule that classifies the z vector into IIz(II~) if

KEY WORDS: Discriminant functions; Robustness; M-estimates.

1. INTRODUCTION

The objective of the classical discriminant analysis problem is to classify a p-variate observation vector Z as having come from one of two populations Il, or II~. The decision rule is constructed from two independent random samples Xl, ... , X"z and Y1, ••• , Yn~, termed training samples, which are known to be from the respective p-variate populations Il; and II~. To simplify the discussion, we assume that there is no prior information warranting special emphasis on either population; that is, the prior probabilities of obtaining an individual from populations Il, and II~ are the same, and the costs of misclassifying individuals from either population are also equal. Consider the rule that classifies the observed vector z into population Il, (II~) if (1.1)

where fz ( .) and f~ (.) are the densities of the respective populations, Il, and II~. This rule minimizes the expected cost which, with our assumptions, is equivalent to minimizing the sum of the two misclassification probabilities. If the populations are assumed to be p-variate normal distributions with common covariance matrix, then replacing the unknown parameters in the rule (1.1) with their appropriate sample estimates yields a rule that classifies an observed z into IIz(II~) if DL(z)

= [z - Hi

+ y)]'S-1[1 -

yJ

>

(~)O,

(1.2)

• Ronald H. Randles is Professor, Department of Statistics, James D. Broffitt is Associate Professor, Department of Statistics, John S. Ramberg is Professor, Division of Systems Engineering and Department of Statistics, and Robert V. Hogg is Professor and Chairman, Department of Statistics, all at the University of Iowa, Iowa City, IA 52242. This research was supported in part by National Institute of Health Grant GM 22271. The authors would like to thank Mark Johnson and Chiang Wang for some of the computer programming used in the Monte Carlo study.

(1.4)

where D Q ( .) is the quadratic discriminant function (QDF) defined by DQ(z) = (z -

y)'S~-l(Z - y) - (z - i)'Sz-l(z - i)

+ In

(IS~I/ISz[)

These· two decision rules, both based on multivariate normal population assumptions, are widely used in practice. The purpose of this article is to describe methods for constructing linear and quadratic discriminant functions which exhibit robustness with respect to changes in the population model and which in particular are not strongly influenced by outliers. Seemingly, such functions should be more effective in classifying observations in the main body of the data. One of the methods uses a distance function approach to concentrate the effectiveness of the rule on the region in which the two populations overlap, the z-values that are most difficult to classify. A second scheme uses discriminant functions of the same form as the LDF and the QDF, except that Huber-type M -estimators determined from the training samples are used to estimate the means and the covariance matrices. The idea behind these new discriminant functions is to determine robustly a one-dimensional linear space that (in some sense) best separates the projections of the observed x's and y's. The decision rule is then completed by deciding on an appropriate cutoff value for the discriminant function scores. For example, the rules in (1.2) and (1.4) use zero as their cutoff value. Randles, Broffitt, Ramberg, and Hogg (1978) develop the use of rank cutoffs for discriminant rules. This idea evolved

564

© Journal of the American Statistical Association September 1978, Volume 73, Number 3&3 Theory and Methods section

565

Randles, Broffitt, Ramberg, and Hogg: Generalized Discriminant Functions

from work (Broffitt, Randles, and Hogg 1976) in partial discriminant analysis in which rules based on ranks, possessing a distribution-free property, were developed. A rank cutoff enables the decision maker to control the relative emphasis of the discrimination rule by producing one with approximate control over the ratio of the two misclassification probabilities. When used together with a robust discriminant function such as the ones introduced in this article, a rank cutoff provides an extra measure of robustness in the performance of the procedure. To describe the construction of a rank cutoff, let D ( .) denote any discriminant function that treats the observations within each of the two training samples symmetrically and which tends to give larger discriminant-function values to X's and smaller ones to V's. Initially we treat z as though it were from population II" and construct a discriminant function, denoted by D,,(·), based on the two training samples Xl, ... , Xn " , Z and Yl, ... , Yny' Let R,,(z) denote the rank of D,,(z) among D,,(Xl), ... , D,,(x n .) , D,,(z), ranking from smallest to largest. Then a smaller rank will indicate that z "looks" more like a Y. If we are sampling from continuous populations and if the z vector actually arises from II", then R,,(z) will have a uniform distribution over the integers 1, 2, ... , n" + 1. The second stage in constructing a rank cutoff is to treat z as though it were from II y , forming D. ( .), a discriminant function based on the training samples Xl, ... , Xn " and Yl, ... , Yny, z, that again tends to give larger function values to the x's than to the y's. Let R. (z) denote the rank of -D.(z) among -D.(Yl), ... , -D.(Yn y), -D.(z). Here a smaller rank makes z look more like an x. Forced discrimination is accomplished by comparing the p-values P,,(z) = R"(z)/(n,, + 1) and p.(z) = R.(z)/(n. + 1) (see Rao 1954; Randles, Broffitt, Ramberg, and Hogg 1978). If no special emphasis is warranted and the experimenter wants to equate the two misclassification probabilities, then z is classified in II" if P,,(z) >" p.(z), and in II. if P,,(z) < p.(z). If the p-values are equal, another rule is used to break the tie. For a detailed description of how ties are broken, plus information on how to use a rank cutoff when there is prior information warranting special emphasis on one of the two populations, see Randles, Broffitt, Ramberg, and Hogg (1978). Sections 2 and 3 describe the two methods for constructing robust discriminant functions. Emphasis is placed on constructing a functional with robust coefficients for the elements of z in anticipation that the resulting functional will be used in conjunction with a rank cutoff. The rules are compared through a Monte Carlo study in Section 4. The rules using M -estimates in the LDF and the QDF, together with rank cutoffs, perform excellently, particularly with heavy-tailed or contaminated distributions.

2. A ROBUST LINEAR RANKING FUNCTION First recall that the coefficients of z in the LDF are ~o

= S-l(1 - y) ,

where X, y, and S are the estimated means and estimated common covariance matrix, respectively. Also is that vector which maximizes

"0

"0

Thus is chosen to describe the direction that maximizes the separation between the two sample means relative to the standard deviation. However, this ratio can be rewritten as 1

n"

,,'Xi - ,,'(i

- L

,

n" i=l

+ y)/2

(" S,,)i

1

ny

,,'(I

+ -n. i~l L

+ y)/2

)i ""

( 's

~'Yi

That is, if we let T (d) = d and the measure of the middle be m = (i + y)/2, then maximizes

"0

}_ i:

T

n" i=l

"'m]'+ ~n. i=lE ["'m(,,'S,,)! - "'yi].

["'Xi WS,,)!

T

(2.1)

This expression represents the average value of a measure of the separation between the training samples whenever T ( .) denotes a nondecreasing, odd, and nonconstant function. Therefore, among hyperplanes of the form ,,'z = ,,'m, we seek the coefficient vector" that maximizes the separation between the training samples in the sense that it maximizes (2.1). An Xi such that "'Xi> "'m is correctly classified and correspondingly contributes a positive increment to (2.1). If Xi satisfies "'Xi < "'m, a negative value is included. A similar (but opposite) statement holds for each Yi' Suppose now that we wish to reduce the influence of observations that are a great distance from a robust measure of the middle, say m. We continue to take the function T to be a nondecreasing, odd function, but with bends in the tails as depicted in Figure A. In addition, we might also want to recognize that the distributions have different covariance matrices, using estimates S" and S., respectively. These substitutions produce a direct generalization of Fisher's original approach; namely, to find the l (say lo) which maximizes [l'Xi - l'm] LT. , i n"i~l (lS"l) 1 n"

U(X) = -

1

n.

+-LT n. i=l

[l'm - l'Yi] . (l'S.l)!

(2.2)

Then for an observation z, we could compute its dis-

A. A Typical T Function

Journal of the American Statistical Association, September 1978

S66

criminant score as lo'z. Values of lo'z larger (smaller) than lo'm indicate classification in IIz(II y ) . For a simple, but extreme, illustration, let us take r (d) to be the sign of d, then U(l) = 2[1 -

(l/nz)(number of x's classified in II y ) - (1/n y) (number of y's classified in II z)J

Thus we choose the coefficients l of the hyperplane through m such that the weighted average of the numbers of those x's and y's on the "wrong" side of this plane is minimized. A similar criterion has been studied by others (Glick 1969), the difference being that we restrict our search to hyperplanes that pass through the vector m. Of course, we could use the ranking procedure for classifications. First, z is placed with the x's so that the training samples are Xl, ... , x n., z and YI, ... , Yn.' This alters the U function slightly because the training samples have sizes n z + 1 and ny, respectively. Then Rz(z) is the rank of lz'z among lz'XI, ... , lz'x n., lz'z, where lz is the maximizing value of l. Similarly, we place z with the y's and use training samples of sizes n z and ny 1 to obtain Ry(z) as the rank of -ly'z among -:>""YI, .. 0' -ly'Yn., -l/z, where now ly is the maximizing value of l. The function r(·) actually used in the Monte Carlo study described in Section 4 is

+

r(d) = -1 ,

= sin ('ITd/4) , = 1 ,

d -< -2

2 , = 1 if a. s 2

This constitutes the first iteration. We can now compute new distances by replacing 1 and S, by i* and S,* in the formula for d i • These in turn produce new weights and thus new values for i* and Sz*. The final values used in our ranking functions are those obtained on the fifth computation of 1* and S, *. This procedure was repeated with the sample from II y to obtain y* and Sy*. These estimates were used in RQH and RrH, and a pooled estimate of the covariance matrix was used in RLH. The pooled estimate was constructed from Sz and Sy as in (1.3). These particular weights Wi were suggested previously by Huber (1977). Five iterations were chosen after some preliminary Monte Carlo experimentation in which it was observed that, while the estimates had not necessarily converged after five iterations, the effects of outliers had been reduced substantially, so that continued iteration until convergence was not necessary.

4. MONTE CARLO RESULTS In this section we present the results of a Monte Carlo study of the procedures described in Sections 2 and 3. Here we consider only bivariate distributions; that is,

567

Y population

IIu

/LI

/L.

U,

u.

nor nor

1.00 1.78 1.00 1.78 1.00 1.78 2.65 3.43 2.01 2.01 3.19 3.19 2.01 0.00 3.19 0.00

1.00 1.78 1.00 1.78 1.00 1.78 2.65 3.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

1 2 1 2 1 2 1 2 2 20 4 40 2 20 4 20

1 3 1 3 1 3 1 3 1 10 3 30 1 10 3 10

Cau Cau Cau Cau

log log can can can can

p = 2. The size of the training samples is n z = lI u = 30 in one case and n z = 20, nu = 40 in the other. The correlation coefficient within each population is equal to j for all cases reported here. Cases in which the correlation was zero were also run and they exhibited similar results. While we considered many distributional situations, we have selected only 12 representative ones on which to report. Note that a bivariate Cauchy (Cau) is included in some of these situations; of course, the means, variances, and correlation coefficient do not exist for this type of distribution. Hence, a word of explanation is necessary. A scale parameter was selected for a central univariate Cauchy random variable W so that Pre -1 < W < 1) = 0.683. Then the central bivariate Cauchy pdf (Johnson and Kotz 1972, p. 134) is proportional to that found by replacing w 2 in the univariate pdf by x'1:- l x where x' = (Xl, X2) and 1: is the covariance matrix (here pseudo) desired. Of course, a noncentral bivariate Cauchy can be created by translating the central one. In each noncontaminated population III/, not only

those associated with the bivariate Cauchy distributions, the amount of translation is such that

In contamination cases, the main populations (contaminants deleted) were also spaced in this fashion. The description of the bivariate log-normal (log) distribution is given in Johnson and Kotz (1972, pp. 17-19). The bivariate normal (nor) is of course used as one of the main distributions, but it is also the basis of the contaminated (con) situations. In each of the contaminated distributional situations (9, 10, 11, and 12), the second distribution listed is the ten percent contaminating bivariate normal distribution of the first. In all of the odd situations (namely, 1, 3, 5, 7, 9, and 11), the covariance matrices of Il; and III/ are equal; while in the even situations (2, 4, 6, 8, 10, and 12), they are unequal. Also recall that the correlation coefficient is equal to t in each noncontaminated case. For the contaminated case, all of the

2. Empirical Percentages Misclassified when n ;

= nil = 3D·

Procedure

L

RL

RQ

RLH

RQH

RTH

RT

Population

,

Situation

IIz

IIu

IIz

IIu

IIz

IIu

IIz

IIu

IIz

IIu

IIz

IIu

IIz

IIu

1 2 3 4 5 6 7 8 9 10 11 12

29

29 33 37 49

29 27 36 37 33 31 25 25 38 35 40 37

29 28 36 37 36 32 27 26 35 37 38 38

29 27 26 25 27 25 26 25

29 28 26 26 29 26 26 26 29 31 31 31

30 19 35 30 37 22 26 26 40 25 41 45

30

29 19 30 25 29 21 27 26 34 22 34 22

30

29 28 32 30 28 27 26 26 35 33 36 33

29 28 31 30 30 28 27 26 32 34

29 28 29 27 27 27 26 26 34 33 36 32

29 28 28 29 30 28 26 26 30 34 32 33

17

37 26 26 15 22 14 40 23 40 41

44

44 31 40 35 44

39 37

33

30 33 30

17

36 27 36 17

27 26 39 24 39 46

• The IIIItlmated standard erro", 01the amplrlcal percantages in this table do not exceed 2.6. A typical value Is 1.0.

17

30 22 31 15 26 26 31 19 32 19

33

33

568

Journal of the American Statistical Association, September 1978

3. Empirical Percentages Misclassified when n x

=

20 and nlJ = 40 8

Procedure

L

RQ

RLH

RL

RQH

RT

RTH

Population Situation

IIor

II u

IIor

IIu

IIor

IIu

IIor

IIu

IIor

IIu

IIor

IIu

IIor

IIu

1 2 3 4 5 6 7 8 9 10 11 12

29 16 36 29 27 11 22 14 40 26 43 47

29 34 32 39 35 46 31 38 33 40 34 34

29 27 32 33 30 29 26 25 37 35 40

29 29 32 33 31 31 25 25 35 35 37 37

29 27 23 23 26 22 25 25 34 30 32 31

29 28 23 23 28 24 25 24 30 31 31 31

32 23 34 26 35 21 28 28 42 27 42

30 16 29 22 30 15 26 24 37 24 37 37

32 23 29 23 28 20 28 28 37 22 35 25

30 16 24 19 27 13 25 23 30 19 31 19

29 29 28 29 28 24 26 26 36 33 36 35

29 29 26 28 29 25 25 25 32 33 34 33

29 29 25 25 26 24 26 26 35 31 34 34

29 29 24 24 29 26 24 24 31 32 32 32

40

43

• The estimated standard arrors of the amplrlcal percantagas In this tabla do not axcaed 2.4. A typical valua is 0.9.

main distributions and contaminating distributions have separate correlation coefficients equal to t. Recall that for convenience in this study we have used rank cutoffs designed to asymptotically balance the two misclassification probabilities. We see in Tables 2 and 3 that the rank cutoff is very effective in controlling this relative emphasis of the decision rule. For example, in Table 2, in situation 8, Fisher's linear discriminant function had observed misclassification percentages of 14 and 40 with respective estimated standard errors of 0.6 and 0.9, while the rule RLH, for example, had observed misclassification probabilities of 25 and 26 percent with estimated standard errors of 0.9 and 0.8. Overall, we were pleased with the performance of the r-procedure (Rr) as compared to L or RL. There are definite gains in heavy-tailed situations 3, 4, 5, and 6, and contaminated situations 9, 10, 11, and 12. However, it is quite interesting to note that substituting Huber's M-estimates into each of the rules improves performance; that is, RLH is better than RL, RQH is better than RQ, and RrH is better than Rr. In general, the RLH and RQH procedures are the ones that we would recommend for use, RLH when the covariance matrices seem to be equal and RQH otherwise (an adaptive selection is discussed in Randles, Broffitt, Ramberg, and Hogg 1978). In particular, consider the distributional situation 12, comparing RQ to RQH. In this case, the distributions were contaminated, not only by changing the standard deviations but also by changing the means. This created relatively bad estimates that were used in the usual QDF, and hence the ranking through RQ was poor. However, the Huber procedures gave very little weight to those outliers from the contaminating distribution and thus produced relatively good estimates; consequently, the ranking through the RQH procedure was good. Note that the two respective averages were dramatically different, namely 0.451 and 0.209. Not only did we observe this in

distributional situation 12, but also in others (not reported here) in which the mean differed in the contaminating distributions and in the main distribution. Surely this is an important case in applications. The Monte Carlo study was conducted using the Super-Duper random number generator (Marsaglia, Ananthanarayanan, and Paul 1973) on an IBM 360/65 computer. Two blocks of 30 (or 20 and 40) bivariate variables were generated to provide the training samples to be used in the various II", and II y distributions. One additional block of 50 bivariate variables was generated from each of the respective populations to provide the z-values to be classified. This total operation was then repeated 100 times. The standard errors of the misclassification probabilities and their values were computed from the 100 replications. [Received December 1976. Revised November 1977. ]

REFERENCES Broffitt, James D., Randles, Ronald H., and Hogg, Robert V. (1976), "Distribution-Free Partial Discriminant Analysis," Journal of the American Statistical Association, 71, 934-939. Fletcher, R., and Powell, M.J.D. (1963), "A Rapidly Convergent Descent Method for Minimization," Compuier Journal, 6, 163-168. Glick, N. (1969), "Estimating Unconditional Probabilities of Correct Classification," Technical Report No. 19, Department of Statistics, Stanford University. Huber, P.J. (1977), "Robust Covariances," Technical Report, Swiss Federal Institute of Technology, Zurich. Johnson, Norman L., and Kotz, Samuel (1972), Distributions in Statistics: Continuous Multivariate Distributions, New York: John Wiley & Sons. Marsaglia, G., Ananthanarayenan, K., and Paul, N. (1973), "How to Use the McGill Random Number Package 'SUPER-DUPER'," School of Computer Science, McGill University. Randles, Ronald H., Broffitt, 'James D., Ramberg, John S., and Hogg, Robert V. (1978), "Discriminant Analysis Based on Ranks," Journal of the American Statistical Association,. 73, 37!}-384. Rao, C. Radhakrishna (1954), "A General Theory of Discrimination when the Information about Alternative Population Distributions Is Based on Samples," Annals of Mathematical Statl".