Joseph W. McKean. Western Michigan University. Simon J. Sheather. Texas A & M University. Abstract. For multivariate data collected over groups, discriminant ...
Discriminant Procedures Based on Efficient Robust Discriminant Coordinates Kimberly Crimin
Joseph W. McKean
Simon J. Sheather
Wyeth Research
Western Michigan University
Texas A & M University
Abstract For multivariate data collected over groups, discriminant analysis is a two-stage procedure: separation and allocation. For the traditional least squares procedure, separation of training data into groups is accomplished by the maximization of the Lawley-Hotelling test for differences between group means. This produces a set of discriminant coordinates which are used to visualize the data. Using the nearest center rule, the discriminant representation can be used for allocation of data of unknown group membership. In this paper, we propose an approach to discriminant analysis based on efficient robust discriminant coordinates. These coordinates are obtained by the maximization of a Lawley-Hotelling test based on robust estimates. The design matrix used in the fitting is the usual one-way incidence matrix of zeros and ones; hence, our procedure uses highly efficient robust estimators to do the fitting. This produces efficient robust discriminant coordinates which allow the user to visually assess the differences among groups. Further, the allocation is based on the robust discriminant representation of the data using the nearest robust center rule. We discuss our procedure in terms of an affine-equivariant estimating procedure. The robustness of our procedure is verified in several examples. In a Monte Carlo study on probabilities of misclassifications of the procedures over a variety of error distributions, the robust discriminant analysis performs practically as well as the traditional procedure for good data and is much more efficient than the traditional procedure in the presence of outliers and heavy tailed error distributions. Further, our procedure is much more efficient than a high breakdown procedure.
KEY WORDS: Affine-equivariant estimators; Least squares; Linear discriminant rule; Nearest center rule; Nonparametrics; Rank-based analysis; Wilcoxon analysis; Visualization.
1
1
Introduction
Consider a multivariate data set where items belong to one of g groups. For such data, discriminant analysis can be thought of as a two stage process: separation and allocation (see for instance Johnson and Wichern (1998) or Seber (1984)). In the separation stage, the goal is to find a representation of the observations that clearly separates the groups. This stage is exploratory in nature and statistical procedures at this stage are inherently graphical. The separation stage results in a kernel and the associated graphical procedure (visualization) is based on the spectral decomposition of this kernel. In the allocation stage, the goal is to assign an unclassified object to one of the known groups using the rule that optimally separates the training data. In Section 2 we review a discriminant analysis procedure based on discriminant coordinates and traditional least squares estimates. Discriminant coordinates are obtained from maximizing the Lawley-Hotelling test for differences between group means and can be used to graphically display the data. In the allocation stage, we use the discriminant representation of the data and the simple “nearest center” rule (in terms of the Euclidean Mahalanobis distance) to assign an unspecified object to one of the known groups. Assuming a multivariate normal distribution for each group (homogeneous covariance structure) and equal prior probabilities, the simple rule is equivalent to the tradition rule with the usual estimates substituted (“plugged-in”) for the parameters. In Section 3, we propose a discriminant analysis procedure based on efficient robust discriminant coordinates. As with the traditional analysis, the robust discriminant analysis is a two stage process (separation and allocation) based on the robust discriminant coordinate representation. This representation is obtained by maximizing a robust Lawley-Hotelling test for differences between group centers. Furthermore the efficiency of our procedure is based on how well the procedure separates small differences among these centers (local alternatives). We show that this efficiency is the same as the efficiency of the robust estimators. Because the fitting is based on the usual one-way incidence design matrix of 0s and 1s, highly efficient robust estimates can be used which results in highly efficient discrimination procedure. These robust discriminant coordinates allow the user to visually (graphically) assess the differences among groups and to robustly explore the data. Most robust estimation schemes can be used in our procedure. All that is required is a √ n consistent equivariant estimator of location with an asymptotic linearity result and a consistent estimate of its asymptotic variance-covariance matrix. The allocation rule is the simple “nearest center” rule in terms of a Mahalanobis distance using the variance-covariance estimate found in the robust version of the Lawley-Hotelling test statistic.
2
In Section 4 we use the affine equivariant robust estimators of multivariate location and scatter proposed by Hettmansperger and Randles (2002) in the generic procedure discussed in the previous section. Their proposed estimator combines an L1 , or spatial median, with Tyler’s (1987) M estimator of scatter. The resulting estimates have a bounded influence function, a positive breakdown and are highly efficient for heavy tailed error structures. Furthermore, if multivariate normal errors are assumed, the “nearest center” allocation rule can be fine tuned to be a consistent estimate of the optimal rule, similar to the traditional plug-in rule. In Section 5 the robustness of our procedure is illustrated with examples. In Section 6 we present the results of a simulation study for the following three methods: the proposed method described in the last paragraph based on Hettmansperger and Randles’ (2002) estimates (HR); the traditional least squares procedure (LS); and a high breakdown but low efficiency method proposed by Hawkins and McLachlan (1997), (HM). Besides the multivariate normal distribution, we generated data from the elliptical contaminated normal and t distributions. The theoretical robustness and efficiency properties of the HR procedure discussed in Section 4 are verified for the situations investigated. Other than the normally distributed data, the HR procedure was more efficient than the LS procedure in terms of empirical misclassification probabilities. Further, it was much more efficient than the high breakdown but low efficient procedure, even at the elliptical Cauchy distributions. There are other robust discriminant analysis procedures in the literature, some of which, such as the Hawkins and McLachlan (1997) procedure, substitute robust estimates for traditional estimates in the linear discriminant rule. An example, which illustrates the difference of such procedures from ours, was proposed by Randles et al. (1978). Their procedure replaces the sample means by Huber type location estimates and the sample variance-covariance matrix with a weighted estimate. Our procedure, though, maximizes a robust Lawley-Hotelling test for differences between group centers. If Huber estimates are used in our procedure, then the location estimates are similar to those of Randles et al., but the estimates of scatter differ. Our estimates use the standardization which is required by the associated Lawley-Hotelling test statistic. Hence in this case, the efficiency of our procedure is the same as the efficiency of the Huber estimator. Thus our procedure is highly efficient. Randles et al. weighted estimate of scatter is not estimating the same matrix and its efficiency properties will differ. Generally the weighting will result in lower efficiency (see Chapter 5 of Hettmansperger and McKean, 1998).
3
2
Traditional Discriminant Analysis
2.1
Notation
Suppose there are g distinct groups. Let xij represent the k×1 random vector of the measured characteristics Pg made on the jth object in the ith group, with n = i=1 ni . The n×k data matrix X contains n row vectors, x0ij , of multivariate observations. Let µi , i = 1, . . . , g, denote the mean for the ith group and let µ denote the g × k matrix whose ith row is µi . Let πi , i = 1, . . . , g, denote the prior probability that observation x belongs to the ith group. In this scenario, the model of interest is the one-way multivariate linear model X = W µ + e,
(2.1)
where W is the incidence matrix and e is an n×k matrix of random errors with E(eij ) = 0 and Var(ei ) = Σ, where e0i is the ith row of the matrix e. Assume that e0i has density function f (x) and distribution function F (x). Denote the jth marginal cdf and pdf of ei by Fj (xj ) and fj (xj ). Let Ω denote the column space of W and let P Ω denote the projection matrix onto the subspace Ω. We next briefly describe the traditional analysis; see, for instance, Chapters 5 and 6 of Seber (1984) for more details.
2.2
Separation
Discriminant coordinates were introduced as a dimension reduction technique useful for “examining clustering effects in the data”, e.g., Gnanadesikan (1977) or Seber (1984). The goal in discriminant coordinates is to find linear combinations of the data that “best” separates the groups of observations. The amount of separation in the groups is proportional to the size of the test statistic for testing H0 : Aµ = 0 versus HA : Aµ 6= 0,
(2.2)
where A is the usual contrast matrix for testing equality of g group means. Since we are interested in the “maximum” amount of separation in the groups, this is equivalent to finding c that maximizes the Lawley-Hotelling type of test statistic for H0 : Aµc = 0 versus HA : Aµc 6= 0.
(2.3)
ˆ LS be the argument that minimizes tr(X − W µ)0 (X − W µ), then µ ˆ LS is the traditional least Let µ squares estimate of µ. The associated Lawley-Hotelling test statistic is ˆ −1 , ˆ LS c)(c0 Σc) TLS = tr(Aˆ µLS c)0 (A(W 0 W )−1 A0 )−1 (Aµ 4
(2.4)
ˆ = where Σ
0 1 n−g X (I
− P Ω )X is the usual estimate of the variance-covariance matrix. Under the null
hypothesis (2.3), TLS has an approximate χ2g−1 -distribution. By the Generalized Cauchy-Schwartz inequality, the maximum value of TLS , (2.4), is λ1 , the maximum eigenvalue of ˆ −1 (Aµ ˆ LS )0 (A(W 0 W )−1 A0 )−1 (Aµ ˆ LS ), Σ
(2.5)
and the direction of maximum separation is c1 , the corresponding orthonormal eigenvector. Then proceed as in principle components obtaining k orthogonal directions c1 , c2 , . . . , ck , which are the k orthonormal eigenvectors corresponding to the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λk ≥ 0 of the matrix in expression (2.5). The eigenvalues of the matrix in expression (2.5) are the same as the eigenvalues of ˆ −1/2 (Aµ ˆ −1/2 ˆ LS )0 (A(W 0 W )−1 A0 )(Aµ ˆ LS )Σ K LS = Σ
(2.6)
which is symmetric and, hence, is easier to handle numerically. Let ai , i = 1, . . . , k denote the corresponding ˆ −1/2 ai , i = 1, . . . , k. The vector ci is called eigenvectors of K LS , (2.6). It can easily be shown that ci = Σ the ith discriminant direction. Let C = [c1 , . . . , ck ]. Then the discriminant coordinate representation of the matrix X is Z = XC, where the columns of Z are the discriminant coordinates. The matrix K LS , (2.6), is called the kernel for the traditional procedure and the associated visualization procedures are graphical methods based on the discriminant coordinates.
2.3
Allocation
The objective of allocation is to classify an unknown object to one of g known groups. Recall that the rule that minimizes the total probability of misclassification (TPM) is Assign x to Gi if πi fi (x) ≥ πj fj (x),
for all j = 1, . . . , g;
(2.7)
see, for instance, Seber (1984). If fi (x) is the pdf of a Nk (µi , Σ) distribution then the optimal rule becomes: Assign x to Gi if Li (x) ≥ Lj (x), for all j = 1, 2, . . . , g,
(2.8)
where, 1 Li (x) = ln(πi fi (x)) = ln πi + µ0i Σ−1 (x − µi ). 2 If the prior probabilities are assumed equal, then a short algebraic derivation shows that the rule in expression (2.8) is equivalent to the nearest center rule, where Euclidean distance is measured using Mahalanobis distance. That is, expression (2.8) is equivalent to Assign x to Gi if (x − µi )0 Σ−1 (x − µi ) ≤ (x − µj )0 Σ−1 (x − µj ) for all j = 1, 2, . . . , g. 5
(2.9)
In practice, the traditional estimates are substituted for the parameters. The discriminant coordinate representation, Z = XC, is the representation for the data which gives maximal separation, so it is the appropriate representation from which to work. In practice, we may only use the first several principal discriminant coordinates to do the allocation. The nearest center rule for discriminant coordinates is: Assign x to Gi if Di (z) ≤ Dj (z) for all j = 1, 2, . . . , g,
(2.10)
ˆ z = C 0Σ ˆ x C, µ ˆ 0zi is the ith row of the g × k matrix where Σ ˆ z,LS = µ ˆ LS C µ
(2.11)
and −1
ˆ z (z − µ ˆ zi ). ˆ zi )0 Σ Di (z) = (z − µ
3
Efficient Robust Discriminant Analysis
In this section, we outline a generic robust discrimination procedure, which is analogous to the traditional procedure. The separation stage is based on maximizing a robust Lawley-Hotelling test statistic. The efficiency of this stage is based on how powerful this test is in separating small differences of location.
3.1
Separation
ˆ P of µ in model (2.1), where P To derive robust discriminant coordinates, begin with a robust estimate µ denotes a generic robust estimating procedure. Recall, W is an incidence matrix so a highly efficient robust estimator can be used. Assume, under regularity conditions, ˆ P is asymptotically Ng,k (µ, (W 0 W )−1 , ΣP ), µ
(3.12)
ˆ P . Note that the square-root of this matrix, where ΣP is the asymptotic variance-covariance matrix of µ 1/2 ˆ P be a consistent estimate of ΣP , is the multivariate analog of the standard error of the estimate. Let Σ
ΣP . Then the Lawley-Hotelling type test statistic for the hypotheses (2.3) is: ˆ P c)−1 . ˆ P c)0 (A(W 0 W )−1 A0 )−1 (Aµ ˆ P c)(c0 Σ TP = (Aµ
(3.13)
Proceeding, as in traditional discriminant coordinates, obtain the k orthogonal directions aP1 , . . . , aPk corresponding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix −1/2
ˆ KP = Σ P
−1/2
ˆ ˆ P )0 (A(W 0 W )−1 A0 )−1 (Aµ ˆ P )Σ (Aµ P 6
.
(3.14)
The robust discriminant coordinates are the columns of Z P = XC P , where C P = [cP1 , . . . , cPk ] and ˆ −1/2 aP . In particular, the vector cP gives the direction for the maximal separation for the generic c Pi = Σ P i 1 robust procedure P . The matrix K P , (3.14), is the kernel for the robust procedure and the associated visualization procedure is based on the robust discriminant coordinates, the columns of Z P .
3.2
Efficiency
The efficiency of the procedure depends on how well the test statistic TP detects small differences among the means. A way to measure this is to determine the asymptotic power of the test TP under local alternatives, 1 Hn : Aµn = √ Aµ0 , n
(3.15)
where µ0 is a g × k matrix not equal to the zero matrix and A is the (g − 1) × g contrast matrix given in the hypotheses (2.2). Assume a sequence of linear models of the form (2.1) indexed by n. Let W n denote the incidence matrix and assume that lim n−1 W 0n W n = ΣW ,
n→∞
(3.16)
where ΣW is positive definite. The asymptotic power of the test statistic TP,n can be determined from its asymptotic distribution under the sequence of alternatives. Assuming certain conditions, we can generally show that TP,n has an asymptotic noncentral χ2g−1 (θP )-distribution,
(3.17)
with g − 1 degrees of freedom and noncentrality parameter −1 θP = tr(Aµ0 )0 AΣW A0 Aµ0 Σ−1 P.
(3.18)
The conditions depend on the specific robust estimator chosen, but often a uniform linearity (quadraticity) result is required. The robust procedure discussed in Section 4 satisfies such a condition. Provided the variance-covariance matrix of the random vector ei is finite, under the sequence of local alternatives defined in expression (3.15), the LS test statistic TLS,n (2.4) has a noncentral χ2 distribution with g − 1 degrees of freedom and noncentrality parameter −1 θLS = tr(Aµ0 )0 AΣW A0 Aµ0 Σ−1 .
(3.19)
It follows that the asymptotic relative efficiency between the robust procedure and LS is the ratio of noncentrality parameters ARE(P, LS) = 7
θLS . θP
(3.20)
This result easily generalizes to the univariate case. Suppose the components of the error random vector ei are iid with variance σ 2 . Then Σ = σ 2 Ik and ΣP = τP Ik , for some parameter τP , which depends on the robust procedure used. Thus in the iid case, the ARE (3.20) simplifies to the univariate formula, ARE(P, LS) = σ 2 /τP2 . For the general multivariate case, expression (3.20) does not simplify; however, ˆ P , (3.12), the by comparing the noncentrality parameter θP , (3.18), with the asymptotic distribution of µ efficiency properties of the separation phase of the robust procedure is essentially the same as the efficiency properties of the robust estimates. Because the fitting is based on the incidence matrix, there are no outliers in factor space; hence, we recommend highly efficient robust estimates.
3.3
Allocation
Let z = C 0P x where C P is the matrix of robust discriminant directions based on procedure P . The robust Mahalanobis distance is: ˆ −1 (z − µ ˆ P,zi ), ˆ P,zi )0 Σ DP,i (z) = (z − µ P,z
(3.21)
where the estimate of location and scatter are given by ˆ P CP , ˆ P,z = C 0P Σ ˆ P,xi and Σ ˆ P,zi = C 0P µ µ
(3.22)
respectively. Then a nearest center, robust linear discriminant rule is Assign x to Gi if DP,i (z) ≤ DP,j (z), for all j = 1, 2, . . . , g.
(3.23)
As discussed in Section 2.3, under the assumption that the prior probabilities of group membership are all equal, the derivation between the traditional rules of expressions (2.8) and (2.9) is algebraic. Hence, the robust rule ˆ i (z) ≥ Lˆj (z), for all j = 1, 2, . . . , g, Assign x to Gi if L where
−1 1 ˆ ˆ ˆ P,zi ΣP,z z − µ ˆ Li (z) = − ln g + µ , 2 P,zi
(3.24)
j = 1, 2, . . . , g.
is equivalent to rule (3.23). For the remainder of this article, we will use the nearest center rule (3.23). The proposed robust discriminant rule (3.23) can be used with most robust estimators. All that is required √ is a n consistent estimate of location and a consistent estimate of its asymptotic variance-covariance matrix. b P . While the rule is The efficiency of the procedure is the same as the efficiency of the robust estimator µ based on asymptotic theory, the Monte Carlo study presented in Section 6 verifies, over the situations covered,
the robustness and validity for the procedure based on the estimator discussed in Section 4. This empirical study involved estimates based on a sample size of 25. 8
4
Robust Affine Equivariant Estimate
Hettmansperger and Randles (2002) proposed an M-estimate for multivariate location which is affine equivariant and robust with positive breakdown. The Hettmansperger and Randles estimator (HR) is the L1 (spatial) median combined with the M-estimate of scatter proposed by Tyler (1987). The estimate proposed by HR minimizes the following dispersion function n X i=1
ˆ T (xi − µ)k kA
(4.25)
ˆ T is a k × k upper triangular, positive definite matrix (with a one in the upper left corner) chosen where A to satisfy: n
−1
n ˆ X ˆ0 AT (xi − µ)(xi − µ)0 A
T
i=1
ˆ T (xi − µ)k2 kA
= k −1 I,
(4.26)
ˆ HR be the value that where I is the k × k identity matrix and k · k denote the Euclidean norm. Let µ minimizes (4.25). Under model (2.1), ˆ HR is asymptotically Ng,k (µ, (W 0 W )−1 , B −1 A∗ B −1 ), µ
(4.27)
where A
∗
B
AT (X − µ)(X − µ)0 A0T = E kAT (X − µ)k2 (X − µ)(X − µ)0 AT = E I− . kAT (X − µ)k kAT (X − µ)k2
(4.28)
ˆ T is a consistent estimator of AT ; see Hettmansperger and Randles (2002) for discussion. Let Further, A ˆ HR = B ˆ −1 A ˆ ∗B ˆ −1 , where B ˆ and A ˆ ∗ are the respective matrices B and A∗ with AT replaced by A ˆT. Σ ˆ HR is a consistent estimate of ΣHR . Then Σ
4.1
Separation
ˆ HR , the Lawley-Hotelling test statistic for hypothesis (2.3), under model ˆ HR and Σ Using the HR estimators µ (2.1), is the statistic THR defined in the following theorem. Theorem 4.1. Assume the regularity conditions in Hettmansperger and Randles (2002) are true. Let ˆ HR c)−1 . ˆ HR c)0 (A(W 0 W )−1 A0 )−1 (Aµ ˆ HR c)(c0 Σ THR = tr(Aµ Then under the null hypothesis THR is asymptotically χ2g−1 . 9
ˆ HR given in equation (4.27) we have Proof. From the asymptotic distribution of µ ˆ HR c is asymptotically Nq,s (Aµc, A(W 0 W )−1 A0 , c0 Σc). Aµ Further, under the null hypothesis Aµc = 0. From these two results, the theorem follows immediately. Based on this theorem, the kernel of the HR discriminant coordinate procedure is −1/2
−1/2
ˆ HR (Aµ ˆ HR . ˆ HR )0 (A(W 0 W )−1 A0 )−1 (Aµ ˆ HR )Σ K HR = Σ
(4.29)
Let aHR1 , . . . , aHRk denote the eigenvectors corresponding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix K HR . Then the HR robust discriminant coordinates are the columns of Z HR = XC HR , where C HR = ˆ −1/2 aHRi . The HR associated visualization procedure is based on these [cHR1 , . . . , cHRk ] and cHRi = Σ HR discriminant coordinates.
4.2
Efficiency
For efficiency results, consider the set up of Section 3.2 with the sequence of local alternatives (3.15). Based on the linearization result given in Hettmansperger and Randles (2002), it follows that under this sequence of local alternatives THR,n has an asymptotic noncentral χ2g−1 (θHR )-distribution,
(4.30)
with g − 1 degrees of freedom and noncentrality parameter −1 Aµ0 Σ−1 θHR = tr(Aµ0 )0 AΣW A0 HR .
(4.31)
The efficiency of the separation procedure is the same as the efficiency of the HR estimator which is discussed in Section 3 of Hettmansperger and Randles (2002). In particular, it appears to be highly efficient for heavytailed error distributions relative to the LS procedure.
4.3
Allocation
The nearest center rule is Assign x to Gi if DHR,i (z) ≤ DHR,j (z), z = C 0HR x, for all j = 1, 2, . . . , g,
(4.32)
where the robust HR Mahalanobis distance is: ˆ −1 (z − µ ˆ HR,zi )0 Σ ˆ HR,zi ), DP,i (z) = (z − µ HR,z 10
(4.33)
and the estimate of location and scatter are given by ˆ HR,zi = C 0HR µ ˆ HR,xi µ
ˆ ˆ HR,z = C 0 Σ and Σ HR HR C HR .
(4.34)
Because of affine equivariance of the HR estimator, these same estimates would be obtained from the transformed data, Z.
4.4
Equivalence to the Traditional Rule
As with most robust procedures, interest centers on how efficient the robust estimate is to the traditional estimate under the multivariate normal distribution. Suppose the rows of e in model (2.1) have a symmetric elliptical error distribution with density proportional to cm h(t0 t). As discussed in Hettmansperger and Randles (2002), r 2 = kek2 has density fr2 (y) =
ck π k/2 k/2−1 y h(y). Γ(k/2)
ˆ HR relative to µ ˆ LS is Then the asymptotic relative efficiency of µ ˆ HR , µ ˆ LS ) = k −2 E(r2 )[E(r−1 )]2 (k − 1)2 . ARE(µ ˆ HR to the least squares µ ˆ LS is At the multivariate normal, the asymptotic relative efficiency of µ !2 −1 Γ( k2 )21/2 k 1 ˆ HR , µ ˆ LS ) = . ARE(µ k−1 k Γ( 2 )(k − 1)
(4.35)
ˆ HR is divided by equation (4.35), then the resulting estimate is consistent for Σ and rule (4.32) is If Σ asymptotically equivalent to the traditional nearest center rule (2.9).
5
Examples
To investigate the robustness of the procedures, we used Fisher’s (1936) classic Iris data set and four contaminated versions of the Iris data set. Recall, the Iris data set consists of three species of Iris with 50 observations of each species. The four variables are sepal length, sepal width, petal length and petal width. Group one is denoted with red circles, group two with green triangles pointing up and group three with blue triangles pointing down. We contaminated the Iris data set with a single outlier in group one, with 5 outliers in group one, with 5 clustered outliers in group one, and 3 outliers in group one and two outliers in group two. For each data set, the visualizations were constructed using the the first two discriminant coordinates of the kernels K LS and K HR . We also calculated the probabilities of misclassification (PMC) 11
using leave-one-out cross-validation. The allocations were based on the nearest center rule using the first two discriminant coordinates.
5.1
Visualization
Figure 1 displays the plots of the first two traditional discriminant coordinates and the HR robust discriminant coordinates on the original Iris data set. In each of the plots, the first coordinate is showing a difference
HR
30
40
Coordinate 2
7 6
20
4
5
Coordinate 2
8
50
9
60
10
Least Squares
−10
−5
0
5
−60
−40
Coordinate 1
−20
0
20
40
Coordinate 1
Figure 1: Iris Data, (original data) in location among the three groups. Figure 2 displays the plots of the first two traditional discriminant coordinates and the HR robust discriminant coordinates from the Iris data set with one outlier. From these plots, the traditional discriminant coordinates fail to separate the groups or identify the outlier (first plot, Figure 2) whereas the robust discriminant coordinates identify the outlier (second plot, Figure 2) and separate the groups (third plot, Figure 2).
12
HR
HR − Zoomed In
50 40
Coordinate 2
30
−200
−100
Coordinate 2
0 −1 −2
5
6
7
8
Coordinate 1
9
10
11
20
−300
−3 −4
Coordinate 2
1
0
2
60
Least Squares
0
500
1000
1500
2000
2500
3000
Coordinate 1
Figure 2: Iris Data, (1 Outlier)
13
−40
−20
0
20
Coordinate 1
40
60
Figure 3 displays the plots of the first two traditional discriminant coordinates and the HR robust discriminant coordinates from the Iris data set with 5 outliers. From these plots, the traditional discriminant
HR
HR − Zoomed In
20
0
30
40
Coordinate 2
6000 4000
Coordinate 2
2000
4 0
2
Coordinate 2
6
50
8000
8
60
10000
Least Squares
8
10
12
Coordinate 1
14
−8000
−6000
−4000
−2000
0
Coordinate 1
−60
−40
−20
0
20
40
Coordinate 1
Figure 3: Iris Data, (5 Outliers) coordinates are separating the groups, but only 3 of the 5 outliers are clearly identified in the plot (first plot, Figure 3). The robust discriminant coordinates identify the 5 outliers and separate the groups (second and third plots, Figure 3). Figure 4 displays the plots of the first two traditional discriminant coordinates and the HR robust discriminant coordinates from the Iris data set with 5 outliers in a cluster. From these plots, the traditional discriminant coordinates identify the 5 outliers but do not separate the groups (first plot, Figure 4). The robust discriminant coordinates identify the 5 outliers and separate the groups (second and third plots, Figure 4). Figure 5 displays the plots of the first two traditional discriminant coordinates and the HR robust discriminant coordinates from the Iris data set with 3 outliers in group one and 2 outliers in group two. From these plots, the traditional discriminant coordinates do not separate the groups or identify the outliers 14
HR
HR − Zoomed In
50 −200
30
40
Coordinate 2
−50 −100
Coordinate 2
−150
3 2 1
5
6
7
8
Coordinate 1
9
10
11
20
−250
0 −1
Coordinate 2
4
0
5
50
60
Least Squares
0
500
1000
1500
2000
2500
3000
Coordinate 1
Figure 4: Iris Data, (5 Outliers in a Cluster)
15
−40
−20
0
20
Coordinate 1
40
60
HR
HR − Zoomed In
50 30
−200 6
7
8
9
Coordinate 1
10
11
20
−250 5
40
Coordinate 2
−50 −100
Coordinate 2
−150
−2 −3 −4 −5
Coordinate 2
−1
0
0
50
60
Least Squares
−2500
−2000
−1500
−1000
−500
0
−60
−40
Coordinate 1
Figure 5: Iris Data, (3 Outliers Grp 1 & 2 Outliers Grp 2)
16
−20
0
Coordinate 1
20
40
(first plot, Figure 5) whereas the robust discriminant coordinates identify the 5 outliers and separate the groups (second and third plots, Figure 5). Thus in terms of separation, the HR discriminant procedure agrees with the traditional LS procedure on the original data. For the contaminated data, in all cases, the the HR procedure separates the groups and identifies all the outliers. In contrast on the contaminated data, the LS procedure either fails to separate or fails to identify all the outliers. In terms of separation, the HR procedure is robust.
5.2
Allocation
Table 1 displays the estimated probability of misclassification (PMC) for each variation of the Iris data set. From the results presented in the table, when the Iris data set is not contaminated the PMC is the same Table 1: Estimated PMCs for each variation of the Iris Data Set Data Set Least Squares HR Original
0.0267
0.0267
1 Outlier
0.4733
0.0333
5 Outliers
0.38
0.06
5 Clustered Outliers
0.52
0.06
3 Outlier Grp 1 & 2 Outliers Grp 2
0.55
0.06
for the LS and HR procedures, but when contamination is added to the data set, the HR procedure has a much lower PMC. In terms of allocation, the HR procedure is robust. The outliers severely hampered the allocation ability of the LS procedure.
6
Simulation Results
In this section, we present the results of a Monte Carlo study which investigates the behavior of the nearest center rules of three procedures in terms of their TPMs over various error distributions. For our procedure we chose the highly efficient robust procedure described in Section 4 based on the HR estimator. For comparison, we included the the traditional procedure (LS) as described in Section 2. In order to investigate our efficiency claims, as our third procedure we selected the high breakdown procedure proposed by Hawkins and McLachlan (1997) using minimum covariance determinants (MCD) as an estimator of scatter. This procedure has high breakdown but low efficiency. Their procedure accommodates a certain percentage of
17
outliers and the estimates are based on the“inliers”. The outliers are the set of points which, when removed, minimize the within group covariance determinant. For the simulation, we used a 50% coverage. We selected 50% coverage because this has highest breakdown and lowest efficiency. For the simulation results presented in this section, we consider situations where there are two groups and four dimensions, i.e., g = 2 and k = 4. For the mean and variance matrices, we chose the sample mean and variance matrices of the beetle data on page 295 of Seber (1984). One thousand data sets were randomly generated from a variety of error distributions. The error distributions used were multivariate normal (MVN), the contaminated multivariate normal (CN), and the multivariate t. For each error distribution, fifty observations were generated for both the training and test data sets with twenty-five observations randomly assigned to each group. The training data set was used to develop the linear classification rule and then this rule was used to classify the test data set and the probability of misclassification was recorded. The empirical TPM was used as a benchmark for the performance of each procedure. To calculate the empirical TPM, we used the rule Assign xi to group 1 if
f (xi |G1 ) > 1, i − 1, . . . , n, f (xi |G2 )
to classify the data. Then for our empirical TPM, we used the proportion misclassified. At the multivariate normal, the true TPM is TPM = Φ(−∆/2), where ∆2 = (µ1 − µ2 )0 Σ−1 (µ1 − µ2 ) is Mahalanobis distance. Table 2 displays the empirical TPM for the distributions used in the simulation. For the multivariate normal, the true TPM is presented in parenthesis. Table 3 displays 84% confidence intervals for the probabilities of misclassification (PMC) of the three procedures. The 84% confidence intervals were chosen because for a two sample-analysis based on onesample confidence intervals, 84% one-sample confidence intervals yield roughly a 95% two-sample confidence interval; see Section 1.12 of Hettmansperger and McKean (1998). From the results displayed in the table, the traditional procedure has the smallest PMC at the multivariate normal. In all the other cases, the HR procedure has lower PMCs than the LS procedure. In fact, in seven of these situations, the confidence intervals do not overlap. Thus the HR procedure is more robust than the LS procedure for moderate to heavy contamination. For comparisons between the two robust procedures, the HR procedure always has a lower PMC than the HM procedure. For the situations considered, the confidence intervals of the two procedures never overlap. The HR procedure is more efficient than the HM procedure over all situations in 18
Table 2: Empirical TPM Distribution
Empirical TPM
MVN (0.1142)
0.1251
CN = 0.10, σ 2 = 9
0.1560
CN = 0.20, σ 2 = 9
0.1791
CN = 0.10, σ 2 = 25
0.1487
CN = 0.20, σ 2 = 25
0.1904
CN = 0.10, σ 2 = 100
0.1665
CN = 0.20, σ 2 = 100
0.1996
t df = 1
0.2278
t df = 2
0.1854
t df = 3
0.1546
t df = 4
0.1545
t df = 5
0.1530
this study, including the elliptical Cauchy distribution. In comparing the HM and LS procedures, the HM procedure has lower PMCs than the LS procedure for heavy-tailed distributions. Next, consider the comparison between the empirical TPMs of Table 2 and the simulated PMCs of Table 3. The HR values are generally much closer to the TPM values than the values of the other procedures.
7
Conclusion
In this paper, we have proposed a discriminant analysis based on efficient robust discriminant coordinates. Like the traditional analysis, this robust analysis is a two stage process: separation and allocation. Maximizing a robust Lawley-Hotelling test based on robust estimates of group centers achieves the separation and produces the discriminant coordinates. The efficiency of the procedure follows from the power of the procedure to detect small differences in group centers. Further, it has the same efficiency of the robust estimates used for the Lawley-Hotelling test. The robust discriminant coordinates can be used to visualize the data as we demonstrated with examples. This visualization is much less sensitive to outliers than the visualization obtained from traditional discriminant coordinates. The robust discriminant coordinates can be further used to form nearest center rules for the allocation of new data to the groups.
19
Table 3: Simulated PMCs for the Procedures Distribution
LS
HR
Hawkins
(0.1365,0.1410)
(0.1426,0.1473)
(0.2323,1.2411)
CN = 0.10, σ = 9
(0.1684,0.1737)
(0.1643,0.1692)
(0.2436,0.2520)
CN = 0.20, σ 2 = 9
(0.1970,0.2027)
(0.1875,0.1926)
(0.2477,0.2552)
CN = 0.10, σ 2 = 25
(0.1919,0.1978)
(0.1697,0.1747)
(0.2449,0.2530)
CN = 0.20, σ 2 = 25
(0.2310,0.2377)
(0.1986,0.2038)
(0.2538,0.2612)
CN = 0.10, σ 2 = 100
(0.2356,0.2434)
(0.1738,0.1788)
(0.2484,0.2564)
CN = 0.20, σ 2 = 100
(0.2953,0.3044)
(0.2074,0.2126)
(0.2605,0.2679)
t df=1
(0.3370,0.3466)
(0.2411,0.2467)
(0.2611,0.2676)
t df = 2
(0.2282,0.2345)
(0.2008,0.2059)
(0.2373,0.2440)
t df = 3
(0.1943,0.1997)
(0.1853,0.1902)
(0.2315,0.2387)
t df = 4
(0.1784,0.1835)
(0.1758,0.1808)
(0.2288,0.2359)
t df = 5
(0.1670,0.1752)
(0.1665,0.1717)
(0.2318,0.2399)
MVN 2
Our procedure is generic in the sense that any robust fitting procedure can be used provided its estimates are root n consistent with an asymptotic linearity result and a consistent estimate of its variance-covariance matrix. The design matrix for the fitting is an incidence matrix, so highly efficient robust estimators are recommended which results in the associated discriminant procedure being highly efficient. In this paper we used the affine equivariant estimator proposed by Hettmansperger and Randles (2002) but any highly efficient robust estimator could be used. The examples that we presented showed the robustness of the procedures on real data. On the original data, the HR robust procedure behaved similarly to the traditional LS procedure in terms of visualization and classification (PMC). However, when outliers were introduced in the Iris data, the results were quite different. The behavior of the robust procedure was quite similar to its behavior on the original data but the traditional procedure’s PMC rate changed from 3% to 48% (on average) and its visualization was quite poor. In our Monte Carlo study, we investigated the behavior of the nearest center rules in terms of misclassifications for three procedures. The data were split into two data sets: “training” and “test”. Empirical PMCs for the procedures were obtained for families of multivariate t- and contaminated multivariate normal distributions. We selected the highly efficient procedure (HR) described in Section 4. As competitors we 20
selected the LS procedure and a high breakdown, but low efficient, procedure proposed by Hawkins and McLachlan (1997). The HR procedure was comparable to the LS procedure when the errors had a multivariate normal distribution but it generally performed much better than the LS procedure for the heavier tailed error distributions. Further, over all situations simulated, the HR procedure had lower empirical PMCs than the high breakdown HM procedure. In summary, the discriminant procedures that we have proposed form an attractive robust alternative to the traditional procedure. The procedures are highly efficient relative to the traditional procedure and they are quick to compute. Further, they produce robust discriminant coordinates which allow the user to visually explore the data and assess the differences among groups.
Acknowledgment The authors thank the associate editor and a referee whose comments led to an improvement of this paper.
References Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate linear models, Journal of the American Statistical Association, 88, 245-251. Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7. Flurry, B. and Riedwyl, H. (1988), Multivariate Statistics: A Practical Approach, London: Chapman and Hall. Gnanadesikan, R. (1977), Methods for Statistical Analysis of Multivariate Observations, New York: John Wiley & Sons. Hawkins, Douglas M. and McLachlan, Geoffry J. (1997), High-Breakdown Linear Discriminant Analysis, Journal of the American Statistical Association, 92:437, 136-143. Hettmansperger, T. P. and McKean, J. W. (1998), Robust Nonparametric Statistical Methods, London: Arnold. Hettmansperger, T.P. and Randles, R. H. (2002), A practical affine equivariant multivariate median, Biometrika, 89, 851-860. Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dispersion of the residuals, Annals of Mathematical Statistics, 43, 1449-1458. Johnson, R. A. and Wichern, D. W. (1998), Applied Multivariate Statistical Analysis, 4th Ed., Upper Saddle River, New Jersey: Prentice Hall.
21
Randles R. H., Broffitt J. D., Ramberg, J. S., and Hogg, R. V. (1978), Generalized linear and quadratic discriminant functions using robust estimates, Journal of the American Statistical Association, 73, 564568. Reaven, G. M. and Miller, R. G. (1986), Robust Regression and Outlier Detection, New York: John Wiley & Sons. Seber, G. A. F. (1984), Multivariate Observations, New York: John Wiley & Sons. Tyler, D. E. (1987), A distribution-free M-estimator of scatter, Annals of Statistics, 15, 234-251.
22