1 TESTS FOR MEAN EQUALITY THAT DO NOT ... - CiteSeerX

3 downloads 0 Views 159KB Size Report
Keselman (1998), a test that uses a heteroscedastic statistic (i.e., Welch,. 1951) with ... Welch (1951), James (1951, 1954), Brown and Forsythe (1974) among.
TESTS FOR MEAN EQUALITY THAT DO NOT REQUIRE HOMOGENEITY OF VARIANCES: DO THEY REALLY WORK?

H. J. Keselman Rand R. Wilcox University of Manitoba University of Southern California Winnipeg, Manitoba Los Angeles, California Canada R3T 2N2 90089-1061

Jason Taylor Rhonda K. Kowalchuk University of Manitoba University of Manitoba Winnipeg, Manitoba Winnipeg, Manitoba Canada R3T 2N2 Canada R3T 2N2

Key Words: Tests for Mean Equality; Variance Heterogeneity; Nonnormality; Monte Carlo; Robust Estimators

ABSTRACT Tests for mean equality proposed by Weerahandi (1995) and Chen and Chen (1998), tests that do not require equality of population variances, were examined when data were not only heterogeneous but, as well, nonnormal in unbalanced completely randomized designs. Furthermore, these tests were compared to a test examined by Lix and Keselman (1998), a test that uses a heteroscedastic statistic (i.e., Welch, 1951) with robust estimators (20% trimmed means and Winsorized variances). Our findings confirmed previously published data that the tests are indeed robust to variance heterogeneity when the data are obtained from normal populations. However, the Weerahandi (1995) and Chen and Chen (1998) tests were not found to be robust when data were obtained from nonnormal populations. Indeed, rates of Type I error were typically in excess of 10% and, at times, exceeded 50%. On the other hand, the

1

statistic presented by Lix and Keselman (1998) was generally robust to variance heterogeneity and nonnormality.

1. INTRODUCTION The Behrens-Fisher problem (see Fisher, 1935) refers to the problem of testing for mean equality in the presence of variance heterogeneity. This problem was originally discussed within the context of a two-group layout but also has been extended to the many-group layout. For example, Welch (1951), James (1951, 1954), Brown and Forsythe (1974) among others (see Gamage & Weerahandi, 1998; Lix & Keselman, 1995) have presented approximate test statistics for testing for mean equality when there are more than two groups and when population variances are not presumed to be equal. Popular methods (e.g., Welch) have been found to be generally robust to variance heterogeneity under normality, but the reverse is true when data are nonnormal (see Lix & Keselman, 1995). For completeness we note that in addition to the popular approximate methods, other solutions to the problem have been presented. For example, transformations of the data, nonparametric tests, as well as tests based on robust estimators (e.g., trimmed means and Winsorized variances) have also been proposed. Unfortunately, these procedures have not proven to be uniformly successful in controlling test size (!) when data are heterogeneous as well as nonnormal, particularly in unbalanced designs. Two recent solutions to this problem have been presented by Weerahandi (1995) and Chen and Chen (1998). These authors have derived test statistics which test for mean equality without requiring that population variances be equal. Thus, researchers may be able to use either of these procedures to test for mean equality and be confident that the test size will not be distorted by heterogeneous variances, a condition believed to characterize applied data (see Wilcox, 1997). Unfortunately, the data presented regarding the operating characteristics of these two test statistics is extremely limited. Gamage and Weerahandi (1998)

2

presented Type I error results indicating that the Weerahandi (1995) procedure is robust to nonnormality and variance heterogeneity in unbalanced designs. However, they only investigated a one-way design containing three treatment groups in which there were a limited number of unequal sample sizes and variances for one type of nonnormal distribution (gamma). Chen and Chen (1998) in their investigation only report power data for their test. Accordingly, the purpose of our investigation was to examine in detail the test statistics presented by these authors. In addition, we compared these procedures to a test examined by Lix and Keselman (1998); namely, a heteroscedastic statistic (i.e., Welch, 1951) that uses trimmed means and Winsorized variances, as suggested by Yuen (1974). 2. DEFINITION OF THE TEST STATISTICS Suppose nj independent random observations X1j , X2j , ÞÞÞ , Xnj j are sampled from population j (j œ "ß ÞÞÞ ß J). We assume that the Xij s (i œ 1, á , nj ; Dj nj œ N) are obtained from a normal population with mean .j and _ unknown variance 5j# , with 5j# Á 5j#w (j Á jw ). Then, let Xj œ Di Xij /nj and _ s2j œ Di (Xij  Xj )2 /nj [Gamage and Weerahandi (1998) defined the sample variance with nj  1 in the denominator, while Weerahandi (1995) used nj ; to replicate the Gamage and Weerahandi findings, however, the denominator needed to be nj .]. The usual less-than-full-rank model Xij œ .  !j  %ij can be applied to the problem at hand where the %ij s are assumed to be

independent random variables with %ij µ N(0, 5j2 ) and !!j œ 0. Thus the J

jœ1

null hypothesis can be expressed as either H0 : !1 œ !2 œ â œ !J or H0 : .1 œ .2 œ â œ .J . 2.1 Generalized F-Test (Weerahandi, 1995). According to Weerahandi his generalized F-test is carried out by determining a generalized p-value

3

which is then compared to the nominal significance level to determine whether the null hypothesis of mean equality can be rejected or not. To determine the generalized p-value one first computes a standardized µ between-group sum of squares, S b , where J J J µ µ q2 q S b œ S b (512 , á , 5J2 ) œ !nj Xj /5j#  (!nj Xj /5j# )2 /! nj /5j# . jœ1

jœ1

(2.1.1)

jœ1

The generalized p-value is calculated to be p œ 1  q, where q œ X ŒHJ1, NJ { J1

NJ

n1 s21 µ s b [ B1 B2 â BJ1 ,

n3 s23 (1B2 )B3 âBJ1 ,

â,

n2 s22 (1B1 )B2 âBJ1 ,

nJ s2J 1BJ1 ]},

(2.1.2)

and HJ1, NJ is the cdf of the F distribution with J  1 and N  J degrees of freedom and the expectaion is taken with respect to independent Beta random variables Ðn 1) Bk µ BetaŒ ! j , k

iœ1

2

Ðnj" 1) , 2

k œ 1, 2, á , J  1.

(2.1.3)

According to Weerahandi (1995) the p-value can be computed by numerical integration with respect to the Beta random variables or also through Monte Carlo methods. He points out that when the number of simulations is large the mean of the probabilites will well approximate the expected value. Interested readers can find a derivation of the method in Weerahandi. 2.2 The Chen and Chen (1998) Method. The statistic presented by Chen and Chen (1998) is an exact single-stage analysis of variance type procedure (as opposed to two-stage procedures--see Bishop and Dudewicz, 1978), which under the null hypothesis of the distribution of the

4

test, “is completely free of the unknown variances.” (Chen and Chen, p. 644) Again assuming the previously defined model, this procedure uses the _ first nj  1 (where nj   3) observations to define the sample mean (Xj ) and µ 2 ), i.e., variance (s j

_ nj 1 Xj œ ! Xij /(nj  1), and iœ1

nj  1 _ 2 µ s j œ ! (Xij  Xj )2 /(nj  2). iœ1

Weights for the observations are defined as Uj œ n1  n" Ê Ðn 1) [s (m) /s j  1] j j j

µ2 µ2

"

µ µ Vj œ n1  n" ÉÐnj  1)[s (m) /s j  1] j j 2

2

(2.2.1)

2 2 2 where µ s (m) is the maximum of µ s 1, á , µ s J . Finally, a weighted sample

mean is calculated as nj µ X .j œ !Wij Xij iœ1

where Wij œ Uj for 1 Ÿ i Ÿ nj  1 Vj for i œ nj where Uj and Vj satisfy the following equations (nj  1)Uj  Vj œ 1,

5

(2.2.2)

2 2 (nj  1)U2j  Vj2 œ µ s (m) /njµ s j.

(2.2.3)

Chen and Chen (1998) indicate that the transformation

tj œ

µ X .j .j Í n Í Íµ2 j 2 s !W Ì j iœ1 ij

µ . has a conditional normal distribution with mean zero and variance 5j2 /s j 2

They also show (p. 646) that the conditional normal distributions of the tj s, 2 given the µ s j , are unconditional and independent Student t variables with nj  2 degrees of freedom. An equivalent version of tj , given by Equation 3 is

tj œ

µ X .j .j . µ s (m) /Ènj

To test H0 Chen and Chen (1998) suggest the statistic

J X .j X .. µ1 ! F œ µ È

µ

jœ1

s (m) /

µ

nj 

2

,

(2.2.4)

J µ µ where X .. œ ! X .j /J. According to Chen and Chen one would reject H0 jœ1

µ1 µ when F  F !, J, n2 , the upper ! percentage point of the null distribution µ1 of F (based on a balanced design). A SAS (SAS, Version 6.12) computer program can be obtained from the authors to obtain critical values for both balanced and unbalanced designs. 2.3 Lix and Keselman's (1998) procedure. Lix and Keselman (1998) and Wilcox, Keselman and Kowalchuk (1998) have shown how to obtain a robust test of location equality in unbalanced one-way layouts when the

6

underlying data are neither normal in form nor possessing equal variability. The heteroscedastic statistic used by Lix and Keselman (1998) and Wilcox, Keselman, and Kowalchuk (1998) is due to Welch (1951). The statistic can be defined as

_ _ J ! wj ÐXj XÑ2 ÎÐJ"Ñ



jœ1

2ÐJ2Ñ ! Ð1wj ÎWÑ " 2 nj 1 ÐJ 1Ñ jœ1 J

2 ß

(2.3.1)

_ _ _ J J wj œ nj Îs2j ß X œ !wj Xj /W, W œ !wj and Xj œ Di Xij /nj and jœ1 jœ1 _ _ 2 2 sj œ Di (Xij  Xj ) /(nj  1), where Xj is the estimate of .j and s2j is the usual where

unbiased estimate of the variance for population j. The test statistic is approximately distributed as an F variate and is referred to the critical value F[(1  !); (J  1), / ], the (1  !)-centile of the F distribution, where error degrees of freedom are obtained from



3! J

jœ1

J2 1

(1wj /W)2

.

(2.3.2)

nj 1

Yuen (1974) initially suggested that trimmed means and variances based on Winsorized sums of squares be used in conjunction with Welch's (1938) two-sample statistic. For heavy-tailed symmetric distributions, Yuen showed that the statistic based on these robust estimators could adequately control the rate of Type I errors and resulted in greater power than a statistic based on the usual mean and variance. While a wide range of robust estimators have been proposed in the literature (see Gross, 1976), the trimmed mean and Winsorized variance are intuitively appealing because of their computational simplicity and good theoretical properties (Wilcox, 1995a). In particular, while the standard error of the usual mean can become seriously inflated when the

7

underlying distribution has heavy tails (Tukey, 1960), the standard error of the trimmed mean is less affected by departures from normality because extreme observations, that is, observations in the tails of a distribution, are removed. Furthermore, as Gross (1976) notes, “the Winsorized variance is a consistent estimator of the variance of the corresponding trimmed mean" (p. 410). In computing the Winsorized variance, the most extreme observations are replaced with less extreme values in the distribution of scores. While the trimmed mean has been shown to be highly effective, we caution the reader that this measure should only be adopted if one is interested in testing for treatment effects across groups using a measure of location that more accurately reflects the typical score within a group when working with heavy-tailed distributions. As an illustration of how a trimmed mean may provide a better estimate of the typical score than the usual mean, consider the example given by Wilcox (1995a, p. 57) in which a single score in a chi-square distribution with four df (hence . œ 4) is multiplied by 10 (with probability .1). This contaminated chi-square distribution has a population mean of 7.6, a value closer to the upper tail of the distribution. However, a 20% population trimmed mean is 4.2, a value that is closer to the bulk of scores, hence closer to the typical score in the distribution. Lix and Keselman (1998) and Wilcox, Keselman, and Kowalchuk (1998) replace the hypothesis of equal means with H! : .t1 œ .t2 œ â œ .tJ , the hypothesis of equal trimmed means. Let X(1)j Ÿ X(2)j Ÿ á Ÿ X(nj )j represent the ordered observations associated with the jth group. Let gj œ [# nj ], where # represents the proportion of observations that are to be trimmed in each tail of the distribution. For reasons summarized by Wilcox (1995a,b), 20% trimming (# œ .2) is used here. The effective sample size for the jth group becomes hj œ nj  2gj . The jth sample trimmed mean is q 1 Xtj œ h j

! X(i)j .

nj gj

iœgj 1

and the jth sample Winsorized mean is

8

(2.3.3)

j q 1 ! Xwj œ n Yij , j

n

iœ1

where Yij œ X(gj 1)j if Xij Ÿ X(gj 1)j œ Xij if X(gj 1)j  Xij  X(nj gj )j œ X(nj gj )j if Xij   X(nj gj )j . The sample Winsorized variance is q 1 !j swj œ n  (Yij  Xwj )2 , j 1 iœ1

(2.3.4)

(nj 1)s2wj 2 µ s wj œ hj (hj 1)

(2.3.5)

n

2

and

estimates the squared standard error of the sample trimmed mean (see Wilcox, 1996). Thus, with robust estimation, the trimmed group means q q (Xtj s) replace the least squares group means (Xj s), the Winsorized group variances estimators ( s2wj s) replace the least squares variances (sj2 s), and Dj hj replaces N, in the statistics and their df. That is, Equations (2.3.1) and (2.3.2) become

_ _ J ! wtj ÐXtj Xt Ñ2 ÎÐJ"Ñ

Ft œ

where

jœ"

2ÐJ2Ñ " 2 ÐJ 1Ñ

!

jœ1

_ _ J # wtj œ hj ε s wj ß Xt œ !wtj Xtj /Wt

and

jœ"

hj 1

Wt œ !wtj , J

jœ"

estimated by

9

(2.3.6)

ß

J Ð1w ÎW Ñ2 tj t

where

/t

is

/t œ

3! J

jœ1

J2 1

(1wtj /Wt )2

.

(2.3.7)

hj 1

3. METHOD Four variables were manipulated in the study: (a) number of groups (4 and 6), (b) sample size (two cases), (c) population distribution (five distributions: one normal and four nonnormal distributions), and (d) degree/pattern of variance heterogeneity (moderate and large/all (mostly) unequal and all but one equal). Variances and group sizes were both positively and negatively paired. Table I contains the numerical values of the sample sizes and variances investigated in this study.

CON A B C D E F G H

Table I Sample Size and Variance Conditions Sample Sizes (Two Cases) Population Variances 10, 15, 20, 25; 15, 20, 25, 30 1, 4, 9, 16 10, 15, 20, 25; 15, 20, 25, 30 1, 1, 1, 36 10, 15, 20, 25; 15, 20, 25, 30 16, 9, 4, 1 10, 15, 20, 25; 15, 20, 25, 30 36, 1, 1, 1 10, 15(2), 20(2), 25; 15, 20(2), 25(2), 30 1(2), 4, 9(2), 16 10, 15(2), 20(2), 25; 15, 20(2), 25(2), 30 1(5), 36 10, 15(2), 20(2), 25; 15, 20(2), 25(2), 30 16, 9(2), 4, 1(2) 10, 15(2), 20(2), 25; 15, 20(2), 25(2), 30 36, 1(5)

As indicated we investigated one-way designs having four and six groups. For each design size, two sample size cases were investigated. In our unbalanced designs, the smaller of the two cases investigated for each design had an average group size of less than 20, while the larger case in each design had an average group size of at least 20. With respect to the effects of distributional shape on Type I error, we chose to investigate conditions in which the statistics were likely to be

10

prone to an excessive number of Type I errors as well as a normally distributed case. Thus, we generated data from four skewed distributions. Specifically, we sampled from a ;26 and a ;23 distribution and we also used the method described in Hoaglin (1985) to generate distributions with more extreme degrees of skewness and kurtosis. These particular types of nonnormal distributions were selected since data obtained in applied settings (e.g., behavioral science data) typically have skewed distributions (Micceri, 1989; Wilcox, 1994a, 1994b, 1995a,b). Furthermore, Sawilowsky and Blair (1992) investigated the effects of eight non-normal distributions identified by Micceri on the robustness of Student's t test and found that only distributions with the most extreme degree of skewness which were investigated (e.g., #1 œ 1.64) were found to affect the Type I error control of the independent sample t statistic. Thus, since the statistics we investigated have operating characteristics similar to those reported for the t statistic, we felt that our approach to modeling skewed data would adequately reflect conditions in which those statistics might not perform optimally. For the ;23 distribution, skewness and kurtosis values are #1 œ 1.63 and #2 œ 4.00, respectively (the corresponding values for the ;26 data are #1 œ 1.15 and #2 œ 2.0) (see Table II). Accordingly, our simulated ;23 distribution mirrors data found in behavioral science experiments with regard to skewness. The other types of nonnormal distributions were generated from the g- and h-distribution (Hoaglin, 1985). Specifically, we chose to investigate two g- and h- distributions: (a) a g œ 1 and h œ 0 distribution and (b) a g œ 1 and h œ .5 distribution. To give meaning to these values it should be noted that for the standard normal distribution g œ h œ 0. When g œ 0 a distribution is symmetric and the tails of a distribution will become heavier as h increases. Values of skewness and kurtosis corresponding to the investigated g and h distributions are (a) #1 œ 6.2 and #2 œ 114, respectively, and (b) #1 œ #2 œ undefined (see Table II). Finally, it should be noted that though the selected combinations of g and h result in extremely skewed distributions, these values according to Wilcox (1994a, 1994b, 1995a,b), are representative of measurements obtained in applied settings (e.g., psychometric measures). Moreover, as Wilcox (1995a) notes, if a procedure performs well over a wide range of

11

simulation conditions, including extreme conditions, this suggests that the positive operating characteristics of the procedure might hold over conditions not considered in the simulation and thus positively reflect on the procedure's versatility. Table II Distributions Investigated and Their Properties Distribution Chi Square (6) Chi Square (3) gœ1&hœ0 g œ 1 & h œ .5

Skewness 1.15 1.63 6.20 Undefined

Kurtosis 2.00 4.00 114 Undefined

As indicated we both positively and negatively paired the group sizes and variances. For positive (negative) pairings, the group having the smallest number of observations was associated with the population having the smallest (largest) variance, while the group having the greatest number of observations was associated with the population having the greatest (smallest) variance. These conditions were chosen since they typically produce distrorted Type I error rates. To generate pseudo-random normal variates, we used the SAS generator RANNOR (SAS Institute, 1989). If Zij is a standard normal variate, then Xij œ .j  (5j ‚ Zij ) is a normal variate with mean equal to .j and variance equal to 5j2 . To generate pseudo-random variates having a ;2 distribution with six (three) degrees of freedom, six (three) standard normal variates were squared and summed. The variates were standardized, and then transformed to ;26 or ;23 variates having mean .tj and variance 5j2 [see Hastings & Peacock (1975), pp. 46-51, for further details on the generation of data from this distribution]. To generate data from a g- and h-distribution, standard unit normal variables (Z) were converted to the random variable

12

Xij œ

h Z2ij exp (g Zij )1 exp Œ 2 , g

according to the values of g and h selected for investigation. To obtain a distribution with standard deviation 5j , each Xij (j œ 1, á , J) was multiplied by a value of 5j obtainable from Table I. It is important to note that this does not affect the value of the null hypothesis when g œ 0 (see Wilcox, 1994a, p. 297). However, when g  0, the population mean for a g- and h-distributed variable is .gh œ

1 2 " (exp{g /2(1  h)}  1) g(1h) #

(see Hoaglin, 1985, p. 503). Thus, for those conditions where g  0, .gh was first subtracted from Xij before multiplying by 5j . Lastly, it should be noted that the standard deviation of a g- and h-distribution is not equal to one, and thus the values enumerated in Table I reflect only the amount that each random variable is multiplied by and not the actual values of the standard deviations (see Wilcox, 1994a, p. 298). As Wilcox notes, the values for the variances (standard deviations) in Table I more aptly reflect the ratio of the variances (standard deviations) between the groups. Our simulation program was written in SAS/IML (SAS, 1989). One thousand replications of each condition were performed using a .05 significance level; Beta values within each simulation were based on 5000 replications (simulations).

4. RESULTS To evaluate the particular conditions under which a test was insensitive to assumption violations, Bradley's (1978) liberal criterion of robustness was employed. According to this criterion, in order for a test to be

13

s ) must be contained considered robust, its empirical rate of Type I error (! s Ÿ 1.5!. Therefore, for the five percent level of in the interval 0.5! Ÿ ! significance used in this study, a test was considered robust in a particular condition if its empirical rate of Type I error fell within the interval s Ÿ .075. Correspondingly, a test was considered to be nonrobust .025 Ÿ ! if, for a particular condition, its Type I error rate was not contained in this interval. In the tables, boldfaced entries are used to denote these latter values. We chose this criterion since we feel that it provides a reasonable standard by which to judge robustness. That is, in our opinion, applied researchers should be comfortable working with a procedure that controls the rate of Type I error within these bounds, if the procedure limits the rate across a wide range of assumption violation conditions. Nonetheless, there is no one universal standard by which tests are judged to be robust, so different interpretations of the results are possible. Tables III and IV contain empirical rates of Type I error for a completely randomized design containing four and six groups, respectively. The tabled data indicate that when the observations were obtained from normal distributions, rates of Type I error were controlled, as was reported by Gamage and Weerahandi (1998), Chen and Chen (1998) and Lix and Keselman (1998). However, our results also very clearly indicate that the procedures due to Gamage and Weerahandi (1998) (GW) and Chen and Chen (1998) (CC) can not limit their rates of Type I error within Bradley's (1978) liberal limit when data were nonnormal. Indeed, even for our midly skewed chi-square (6) distribution, rates were typically liberal, approaching values that were approximately equal to 10%. As the the nonnormality of the sampled distribution increased, rates of error became progressively larger, attaining values in excess of 50%.

14

TABLE III Empirical Rates of Type I Error (J œ 4) CON 5 2 s nj s

GF .057 .061

CC .067 .074

LK .051 .059

Population Type ;23 GF CC LK .052 .100 .056 .067 .087 .056

;26

A A

10 15

Normal GF CC LK .046 .050 .058 .042 .049 .052

B B

10 15

.055 .057

.049 .047

.055 .060

.079 .059

.080 .067

.059 .056

.079 .065

.094 .098

.055 .057

.113 .102

.173 .155

.051 .053

.331 .361

.334 .344

.027 .027

C C

10 15

.057 .043

.055 .046

.065 .060

.101 .074

.072 .071

.063 .060

.113 .100

.097 .101

.071 .067

.217 .172

.171 .132

.086 .079

.601 .575

.284 .227

.026 .028

D D

10 15

.059 .046

.047 .045

.066 .057

.098 .080

.110 .090

.059 .056

.101 .078

.087 .090

.072 .063

.154 .174

.188 .187

.075 .065

.417 .413

.405 .366

.025 .029

15

gœ1&hœ0 GF CC LK .087 .140 .061 .091 .135 .061

g œ 1 & h œ .5 GF CC LK .465 .209 .027 .472 .212 .026

TABLE IV Empirical Rates of Type I Error (J œ 6) CON 5 2 s nj s

GF .059 .085

CC .076 .067

LK .063 .053

Population Type ;23 GF CC LK .076 .104 .061 .072 .103 .059

;26

E E

10 15

Normal GF CC LK .061 .055 .070 .056 .044 .057

F F

10 15

.050 .054

.053 .043

.060 .061

.064 .064

.074 .082

.055 .053

.096 .088

.122 .111

.069 .067

.123 .116

.208 .164

.058 .067

.361 .315

.375 .331

.025 .030

G G

10 15

.066 .056

.043 .040

.080 .069

.085 .075

.083 .076

.069 .057

.107 .104

.106 .106

.082 .070

.208 .191

.179 .176

.098 .093

.696 .689

.254 .266

.024 .031

H H

10 15

.044 .047

.051 .041

.074 .063

.083 .076

.097 .085

.060 .055

.082 .080

.144 .117

.067 .073

.146 .136

.217 .213

.074 .070

.386 .406

.393 .364

.025 .028

16

gœ1&hœ0 GF CC LK .145 .191 .076 .156 .162 .067

g œ 1 & h œ .5 GF CC LK .624 .272 .027 .638 .219 .026

The procedure presented by Lix and Keselman (1998) (LK) however, was, in most instances, able to limit its rate of Type I error within Bradley's (1978) interval. Indeed, out of the 80 investigated conditions, the test was liberal in just 7 cases (there was also one conservative value). 5. DISCUSSION We were not surprised to find that the Weerahandi (1995) and Chen and Chen (1998) procedures would not be robust to heterogeneity of variances when data were also nonnormal in unbalanced designs. To date, most test statistics that are intended to cope with the effects of variance heterogeneity have been found to lack robustness when heterogeneity of variances occurs with data that are also nonnormal, particularly when group sizes are unequal. As we indicated in our introduction no procedure for testing mean equality has been found to be uniformly robust to assumption violations when they occur simultaneously. However, this unfortunate state of affairs relates only to test statistics that use least squares measures of central tendency and variability. On the other hand, our results, and those presented by others, indicate that researchers can generally, though not uniformly, obtain a robust test of treatment performance equality by substituting robust measures of central tendency and variability into heteroscedastic test statistics (see e.g., Keselman, Kowalchuk & Lix, 1998; Keselman, Lix & Kowalchuk, 1998; Keselman & Wilcox, 1999; Lix & Keselman, 1998). That is, by substituting 20% trimmed means and Winsorized variances into, say, the Welch (1951) test, one typically can achieve robustness to both nonnormality and variance heterogeneity, even in unbalanced designs. The benefits of using robust estimators, that is, 20% trimmed means and Winsorized variances instead of least squares estimators to combat the effects of nonnormality has been discussed extensively (see e.g., Keselman, Kowalchuk & Lix, 1998; Keselman, Lix & Kowalchuk, 1998; Keselman & Wilcox, 1999; Lix & Keselman, 1998; Wilcox, 1995a,b, 1997). Finally, we note that we did not compare the power of the procedure presented by Lix and Keselman (1998) to those presented by Weerahandi

17

(1995) and Chen and Chen (1998) because the latter procedures were not able to control their rates of Type I error. That is, comparisons of power are only meaningful when the procedures being compared are capable of controlling their rates of Type I error. However, we should point out that the power characteristics of statistics based on robust estimators can be predicted from theory and prior work (Lix and Keselman, 1998). That is, as previously indicated, theory tells us that procedures based on sample means result in poor power because the standard error of the mean is inflated when distributions have heavy tails; however, this is less of a problem when working with trimmed means (see Tukey, 1960; Wilcox, 1995b). This phenomenon is illustrated in a number of sources. For example, Wilcox (1994b, 1995b) has presented results indicating that in the two sample and one-way problem, tests (i.e., t and F) based on the usual least squares estimators lose power when data contains outliers and/or is heavy tailed. Specifically, in the two sample problem, Wilcox (1994b) compared the Welch (1938) and Yuen (1974) procedures and found that when data were obtained from contaminated normal distributions (distributions that have thicker tails compared to the normal) the power of Welch's test was considerably diminished compared to its sensitivity to detect nonnull effects when data were normally distributed and, as well, was less sensitive than Yuen's test. Indeed, the power of Welch's test to detect nonnull effects went from .931 when distributions were normally distributed to .278 and .162 for the two contaminated normal distributions that were investigated; the corresponding power values for Yuen's test were .890, .784, and .602, respectively. Wilcox (1995b) presented similar results for four independent groups. Readers should also refer to the data presented by Lix and Keselman (1998) which compared the power values of other independent group statistics based on robust estimators. ACKNOWLEDGEMENTS This research was supported by a Natural Sciences and Engineering Research Council (Canada) grant.

18

BIBLIOGRAPHY Bishop, T. A. and Dudewicz, E. J. (1978). “Exact analysis of variance with unequal variances: Test procedures and tables.” Technometrics, 20, 419-430 Bradley, J.V. (1978). “Robustness?” British Journal of Mathematical and Statistical Psychology, 31, 144-152. Brown, M.B., and Forsythe, A.B. (1974). “The small sample behavior of some statistics which test the equality of several means,” Technometrics, 16,129-132. Chen, S., and Chen, H.J. (1998). “Single-stage analysis of variance under heteroscedasticity,” Communications in Statistics-Simulation and Computation, XX, 641-666. Fisher, R. A. (1935). “The fiducial argument in statistical inference,” Annals of Eugenics, 6, 391-398. Gamage, J. and Weerahandi, S. (1998). “Size performance of some tests in one-way ANOVA,” Communications in Statistics-Simulation and Computation, XX, 625-639. Gross, A. M. (1976).“Confidence interval robustness with long-tailed symmetric distributions,” Journal of the American Statistical Association, 71, 409-416. Hastings, N. A. J., and Peacock, J. B. (1975). “Statistical distributions: A handbook for students and practitioners,” New York: Wiley. Hoaglin, D.C. (1985). “Summarizing shape numerically: The g- and h distributions,” In D. Hoaglin, F. Mosteller, & J. Tukey (Eds.), Exploring data tables, trends, and shapes (pp. 461-513). New York: Wiley. James, G. S. (1951). “The comparison of several groups of observations when the ratios of the population variances are unknown,” Biometrika, 38, 324-329. James, G. S. (1954). “Tests of linear hypotheses in univariate and multivariate analysis when the ratios of the population variances are unknown,” Biometrika, 41, 19-43.

19

Keselman, H.J., Kowalchuk, R.K., and Lix, L.M. (1998). “Robust nonorthogonal analyses revisited: An update based on trimmed means,” Psychometrika, 63, 145-163. Keselman, H.J., Lix, L.M., and Kowalchuk, R.K. (1998). “Multiple comparison procedures for trimmed means,” Psychological Methods, 3,123-141. Keselman, H.J., and Wilcox, R.R. (1999). “The 'improved' Brown and Forsythe test for mean equality: Some things can't be fixed,” Communications in Statistics-Simulation and Computation, 28(3),687-698. Lix, L.M., and Keselman, H.J. (1995). “Approximate degrees of freedom tests: A unified perspective on testing for mean equality,” Psychological Bulletin, 117, 547-560. Lix, L.M., and Keselman, H.J. (1998). “To trim or not to trim: Tests of mean equality under heteroscedasticity and nonnormality,” Educational and Psychological Measurement, 58, 409-429 (Errata: 58, 853). Micceri, T. (1989). “The unicorn, the normal curve, and other improbable creatures,” Psychological Bulletin, 105, 156-166. SAS Institute Inc. (1989), “SAS/IML software: Usage and reference, version 6 (1st ed.),” Cary, NC: Author. Sawilowsky, S.S., and Blair, R.C. (1992). “A more realistic look at the robustness and Type II error probabilities of the > test to departures from population normality,” Psychological Bulletin, 111, 352-360. Tukey, J. W. (1960). “A survey of sampling from contaminated normal distributions,” In I.Olkin et al. (Eds.), Contributions to probability and statistics. Stanford, CA: Stanford University Press. Weerahandi, S. (1995). “ANOVA under unequal error variances,” Biometrics, 51, 589-599. Welch, B.L. (1938). “The significance of the difference between two means when the population variances are unequal,” Biometrika, 29, 350-362. Welch, B.L. (1951). “On the comparison of several mean values: An alternative approach,” Biometrika, 38, 330-336.

20

Wilcox, R.R. (1994a). “A one-way random effects model for trimmed means,” Psychometrika, 59, 289-306. Wilcox, R.R. (1994b). “Some results on the Tukey-McLaughlin and Yuen methods for trimmed means when distributions are skewed,” Biometrical Journal, 36, 259-273. Wilcox, R.R. (1995a). “ANOVA: A paradigm for low power and misleading measures of effect size?,” Review of Educational Research, 65(1), 51-77. Wilcox, R.R. (1995b). “ANOVA: The practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies,” British Journal of Mathematical and Statistical Psychology, 48, 99-114. Wilcox, R.R. (1996a). “Statistics for the social sciences,” New York: Academic Press. Wilcox, R.R. (1997). “Introduction to robust estimation and hypothesis testing,” New York: Academic Press. Wilcox, R. R., Keselman, H. J., and Kowalchuk, R. K. (1998). “Can tests for treatment group equality be improved?: The bootstrap and trimmed means conjecture,” British Journal of Mathematical and Statistical Psychology, 51, 123-134. Yuen, K.K. (1974). “The two-sample trimmed t for unequal population variances,” Biometrika, 61, 165-170.

21