Assessing the accuracy of ANOVA calculations in statistical software

83 downloads 9717 Views 504KB Size Report
analyze most data sets accurately and to provide warnings when the data set is so ... calculations in three widely used mainframe statistical packages: SAS, ...
Computational North-Holland

Statistics

& Data Analysis

8 (1989) 325-332

325

Assessing the accuracy of ANOVA calculations in statistical software Stephen

D. SIMON

*

Bowling Green State University, Bowling Green, OH 43403, USA

James P. LESAGE

*

The University of Toledo, Toledo, OH 43606, USA Received August 1988 Revised April 1989 Abstract: In this paper, we propose a flexible benchmark for measuring the accuracy of ANOVA calculations. The benchmark allows us to control the number of factors, the number of levels within each factor, and the number of observations within each cell. An additional parameter controls how close to constant the data within a cell is. The findings from using this benchmark to test three major mainframe statistical packages indicate that all three packages ignore ill-conditioning that occurs when the data grows more nearly constant. The packages print out highly inaccurate results without a warning to the user. We propose a simple diagnostic, CD, which measures the number of constant digits in a data set and which would detect highly ill-conditioned data sets before they are analyzed. Keywork

Accumulation

error, Cancellation

error, Ill-conditioning,

benchmark

data set.

1. Introduction Most users of statistical packages assume that the packages are designed to analyze most data sets accurately and to provide warnings when the data set is so poorly ill-conditioned that accurate results cannot be computed. Application of a benchmark procedure proposed here indicates that this is not true for ANOVA calculations in three widely used mainframe statistical packages: SAS, SPSSX, and BMDP. These packages were found to be highly inaccurate when faced with ill-conditioning arising from situations where the data in each cell is nearly constant. In this situation, all three of the packages produce intolerably inaccurate results and provide no warning to the user. This result parallels results in Simon and Lesage (1988), where the regression procedures in SAS and SPSSX * The authors would draft of this paper. 0167-9473/89/$3.50

like to thank

two anonymous

referees

0 1989, Elsevier Science Publishers

for helpful

comments

B.V. (North-Holland)

on an earlier

S.D. Simon, J.P. Lesage / The accuracy of ANOVA

326

calculations in software

were found to ignore the same type of ill-conditioning, and produce highly inaccurate results, and no warning for the user. In this paper, we present a flexible benchmark data set for measuring the accuracy of ANOVA calculations. This benchmark is useful for one, two, or multifactor ANOVA procedures. The benchmark allows control over the number of levels within a factor, and the number of observations per cell. In addition we can control the extent to which the data in each cell becomes more nearly constant, increasing the ill-conditioning of the benchmark data set. We apply the benchmark to one factor ANOVA procedures in SAS, SPSSX, and BMDP. For highly ill-conditioned data sets, all three packages produce grossly inaccurate results with no warning provided to the user. We suggest a simple diagnostic check to screen out data sets that tend to produce inaccurate results. The proposed diagnostic, related to the coefficient of variation, measures the number of digits that are constant for values in a data set. The paper proceeds as follows. Section 2 discusses the ANOVA benchmark, section 3 presents accuracy results for the three major packages, and proposes a simple diagnostic check. Section 4 draws some conclusions.

2. The ANOVA benchmark In this section, we present a flexible benchmark for ANOVA calculations. We present versions of the benchmark for one and two factor ANOVA with balanced data sets. Modifications for multifactor ANOVA and for unbalanced ANOVA are also suggested. The simplest version of this benchmark is for one factor ANOVA, balanced data. Let i= l,...,

x,, = Y + +G) + e(j), where I and J are odd integers

+(k)

=

0.2 0.1 i 0.3

greater

I,

j=l,...,

J,

(1)

than or equal to three and where

if k = 1 if k = 2, 4, 6,. . . if k=3,5,7 ,....

(2)

In this benchmark, the first subscript indicates which group the data value belongs to. The ith group has mean y + .2 + G(i), and variance of 0.01. Some calculation will show that the Mean Square Between Treatments (MSTR) is 0.01 J and that for the Mean Square Within Treatments (MSE) is 0.01, making the F-ratio equal to J. The benchmark for two factor ANOVA, balanced data is similar. Let X,,, = Y + G(i) + +(j) i=l

T---Y I,

j=l,...,

+ G(k), J,

k=l,...,

K,

(3)

where I, denotes the number of levels of the first factor; J, is the number of levels of the second factor; and K, represents the number of observations, and I,

SD. Simon, J.P. L.esage / The accuracy of ANOVA

calculations in software

327

Table 1 Example benchmark data sets. Example

I - One factor ANOVA,

balanced data, I = 5, J = 7, r = 1000

i=l

i=2

i=3

i=4

i=S

1000.4 1000.3 1000.5 1000.3 1000.5 1000.3 1000.5

1000.3 1000.2 1000.4 1000.2 1000.4 1000.2 1000.4

1000.5 1000.4 1000.6 1000.4 1000.6 1000.4 1000.6

1000.3 1000.2 1000.4 1000.2 1000.4 1000.2 1000.4

1000.5 1000.4 1000.6 1000.4 1000.6 1000.4 1000.6

Example 2 - Two factor ANOVA,

balanced data, I = 3, J = 3, K = 5, y = 10

i=l j=l

i=l j=2

i=l j=3

i=2 j=l

i=2 j=2

i-2 j=3

i=3 j=l

i=3 j=2

i=3 j=3

10.6 10.5 10.7 10.5 10.7

10.5 10.4 10.6 10.4 10.6

10.7 10.6 10.8 10.6 10.8

10.5 10.4 10.6 10.4 10.6

10.4 10.3 10.5 10.3 10.5

10.6 10.5 10.7 10.5 10.7

10.7 10.6 10.8 10.6 10.8

10.6 10.5 10.7 10.5 10.7

10.8 10.7 10.9 10.7 10.9

Example 3 - One factor ANOVA,

unbalanced data, I = 5, J = 7, y = 100

i=l

i=2

i=3

i=4

i=5

100.4 100.3 100.5 100.3 100.5 100.3 100.5

100.3 100.2 100.4

100.5 100.4 100.6

100.3 100.4 100.2 100.4

100.5 100.4 100.6 100.4 100.6

K, J are all odd integers greater than or equal to three. The function C@ is defined as in expression (2). In this benchmark, the Mean Square for the first and second factors are 0.0X while MSE equals 0.01. The mean square for interaction is zero. Examples of the benchmark for one and two factor ANOVA appear in Table 1. The extension to multifactor ANOVA is straightforward. The Mean Square for each main effect is 0.01 times the number of observations per cell. The Mean Square for all two-way and higher interactions is zero, while MSE is 0.01. A third example in Table 1 illustrates how this benchmark might be adapted to unbalanced data. This benchmark has the flexibility to explore inaccuracies from two distinct sources: cancellation error and accumulation error. The parameter y controls cancellation error. As y increases the data grows more nearly constant. Accurate computation of sums of squares becomes more difficult because subtracting

S. D. Simon, J. P. Lesage / The accuracy of A NOVA calculations in software

328

treatment means error. The remaining number of levels increase the total accumulation of

for the overall

mean from the data produces

large cancellation

parameters control accumulation error. As we increase the of a factor or the number of observations within a cell, we also number of required arithmetic computations. This increases the small errors, making accurate computations difficult.

3. The Accuracy in SAS, SPSSX, and BMDP Tables 2 through 4 show an application of this benchmark to a variety of mainframe statistical packages. These packages: SAS, SPSSX, and BMDP, were running on an IBM 4381 computer system. We used the one factor ANOVA

Table 2 Results of the benchmark Y

MSE

MSTR J=21

201

SPSS-X lE+l lE+3 lE+5 lE+7 lE+9

oneway (I = 9) 4.0 0.7 0.0 0.0 0.0

SPSS-X lE+l lE+3 lE+5 lE+7 lE+9

breakdown (I = 9) 4.3 0.6 0.1 0.0 0.0

SPSS-XANOVA lE+l lE+3 lE+5 lE+7 lE+9 SPSS-X lE+l lE+3 lE+5 lE+7 lE+9

on the SPSS package.

MANOVA 3.4 1.0 ** **

(I = 9) (I = 9) 3.3 0.9 ** **

2001

21

F 201

2001

21

201

2001

-

-

-

-

-

4.0 ** **

-

-

**

4.4 0.7 0.0

3.9 0.0 0.0

4.0 ** **

-

_ -

-

2.0 0.0 t

3.5 0.6 0.0

-

_ -

1.3 ** **

6.3 3.8 1.2 ** **

0.0 0.0 -

-

-

3.6

5.7 3.3 0.9 ** **

-

-

1.4 ** **

1.3 ** **

-

-

6.3 3.2 0.0 0.0

5.2 2.1 0.0

-

-

4.4

5.5 3.4

6.0 3.6 1.1 ** **

6.0 3.6 1.1 ** **

t

Notes: - denotes that the quantity displayed perfect accuracy for all digits displayed. * * denotes that the package labelled either MSTR or MSE as equal to zero. 7 denotes that the package reported at least one of the quantities MSTR, MSE, or F as being negative.

S.D. Simon, J.P. Lesage / The accuracy of ANOVA calculations in software

329

Table 3 Results of the benchmark on the SAS package. Y

MSE

MSTR J=21

201

F

2001

21

201

2001

21

201

PROC ANOVA (I = 9) lE+l lE+3 6.1 5.1 lE+5 2.3 1.6 lE+7 ** 0.0 1E+9 ** **

8.1 4.5 0.4 0.0 **

2.4 ** **

_ 5.7 2.0 0.0 **

-

_

-

-

5.2 0.8 0.0 **

2.0 ** **

1.5 0.0 **

4.5 0.2 0.0 **

PROC GLM (I = 9) lE+l _ _ lE+3 lE+5 7.6 lE+7 6.3 5.2 lE+9 4.1 3.1

_ 8.5 5.4 4.1 **

_ 6.0 3.9 0.0 0.0

_

_

_

_

_

5.2 2.0 0.0 0.0

4.7 0.6 0.0 **

0.0 0.0

2.0 0.0 0.0

4.7 0.7 0.0 **

2001

Notes: - denotes that the quantity displayed perfect accuracy for all digits displayed. * * denotes that the package labelled either MSTR or MSE as equal to zero. t denotes that the package reported at least one of the quantities MSTR, MSE, or F as being negative.

benchmark with I = 9 groups and J = 21, 201, or 2001 observations within a group. The parameter y was set at 1E + 1, 1E + 3, 1E + 5, 1E + 7, 1E + 9. All Three packages produced unacceptably inaccurate results from extreme values of y. BMDP(lV) produced an estimated MSTR of 3.6313 instead of 2.01 for the case (I = 9, J = 201, y = 1E + 5). SAS produced an F-ratio of 0.27 instead of 2001 for (I = 9, J = 2001, y = 1E + 7) in PROC ANOVA and MSE = 6667.3684 instead of 0.01 for (I = 9, J = 2001, y = 1E + 7) in PROC GLM. Finally, the SPSSX BREAKDOWN procedure resulted in negative values for both MSTR and MSE (I = 9, J = 2001, y = 1E + 9). All these inaccuracies occured with no warning to the user.

Table 4 Results of the benchmark on the BMDP package. Y

MSTR J=21

lE+l lE+3 lE+5 lE+7 lE+9

MSE 201

21

F 201

21

201

2.0 0.0 ** **

5.3 2.9 0.0 ** **

4.7 2.2 1.3 ** **

0.0 ** **

3.4 0.1 ** **

0.1 ** **

Note: Storage problems at our installation prevented us from running cases with J = 2001. - denotes that the quantity displayed perfect accuracy for all digits displayed. * * denotes that the package labelled either MSTR or MSE as equal to zero. t denotes that the package reported at least one of the quantities MSTR, MSE, or F as being negative.

SD. Simon, J.P. Lesage / The accuracy of ANOVA

330

calculations in software

On the positive side, SPSSX provided at least three digits of accuracy for MSTR, MSE, and the F-ratio for all values tested in its ANOVA procedure. All other procedures in SAS, SPSSX, and BMDP showed at least one case (usually several) where calculations were accurate to less than two significant digits. We admit that the data sets we use are extreme. A cautious data analyst would probably spot these sorts of data sets before performing these statistical procedures. Nonetheless users should be aware that the these packages do not warn of such problems; they would routinely process this data and return results with unreasonable accuracy. It should be noted that the algorithms are not what is being criticized here, rather the diagnostic detection and warning mechanisms are at fault. There are always data sets that are so ill-conditioned as to prohibit accurate computation of results. In the face of this two actions would seem acceptable. First, the package could set either MSTR or MSE or both equal to zero, implying that the data is for all practical purposes constant. Second, the package could print a warning message. Producing inaccurate results without warning as shown in the above benchmark is clearly an unacceptable alternative. Therefore, the proper focus should be on improving the ability to diagnose extremely ill-conditioned data sets. In the following we propose a simple diagnostic which can be applied to an ANOVA model. The diagnostic check, CD or constant digits is defined as CD=

-log,,IR/MINI

= 0

otherwise,

if

Rc

IMINI (4)

where R represents the range of a set of data, MIN denotes the minimum value of the data set. The fraction R/MIN measures the relative change from the smallest to the largest values in the data set. The negative logarithm puts this on an easily understood scale; CD measures the number of digits that are constant for all values form the minimum to the maximum. The rationale for this formula is as follows. Cancellation error in computation of the deviations from the mean causes problems only when every data is close to the mean. If even a few data values are much different form the mean, no error will occur. These few large (and accurately computed) large deviations will swamp all the inaccurately computed small computed deviations. The range measures how close all the data values are to one another and thus to the mean. Note that for a data set with both positive and negative values, CD is zero. This is logical, since there are no constants digits in such a data set. Surprisingly, it is possible for cancellation error to cause problems in such a data set. Cancellation error is defined as a subtraction involving two numbers roughly equal in magnitude (or the addition of two numbers roughly opposite in magnitude). In a data set with both positive and negative values it is impossible for the mean to be roughly equal in magnitude to every value. Table 5 lists a fictional data set to illustrate the use of CD. For each of the groups, we see that the first 3 or, 4 significant digits are constant. The CD’s are

S.D. Simon, J.P. L.esage / The accuracy of ANOVA Table 5 An illustration

of CD used on fictional

calculations in software

331

data.

Data

Diagnostic R MIN CD

A 23.447 23.445 23.439 23.441

B 23.522 23.524 23.516 23.522

C 23.298 23.303 23.297 23.306

0.008 23.439 3.5

0.008 23.516 3.5

0.009 23.297 3.4

3.5, 3.5, and 3.4 respectively, reflecting the fact that computations of deviations from the means (needed to obtain MSE) results in a cancellation error or more that 3 digits. This could pose a problem for a single precision package since half of the orginal 6 to 7 digits of accuracy are in lost the initial computation of deviations from the mean. Since MSTR involves computation of deviations of cell means, the CD of cell means is also of interest. In the above example, the cell means are 23.443, 23.521, and 23.301. These means have, CD = 2.0, indicating that accurate computation of MSTR, in this example is less difficult than MSE. For any value of y larger than 10, CD is approximately 0.7 + log,,y for each cell. The same approximation holds for the CD of the cell means. Thus, by increasing y by a factor of 100, we are increasing the size of cancellation error by 2 digits for computation of both MSTR and MSE. For data sets where CD would be large, we can replace the minimum value by the mean and the range with 6 times the standard deviation obtaining CD=

-log,,[6(S/x)]

= -log,,[6(CV)],

where CV, is the coefficient of variation. Thus CD is approximately related to the coefficient of variation. We prefer CD because of its simple interpretation. Any diagnostic should also take the size of the data set into account. Clearly, an increase in the size of the data increase the number of calculations causing an increase in accumulation error. We have noticed a general pattern that increasing the size of the data by a factor of ten usually causes the additional loss of approximately 0.5 to 2.0 digits of accuracy. This is an area which merits further study.

4. Conclusion A simple benchmark for assessing the numerical accuracy of ANOVA procedures was proposed. This benchmark is extremely flexible, allowing the number of

332

SD. Simon, J.P. Lesage / The accuracy of ANOVA

calculations in software

factors, the number of levels within a factor, and the number of observations to be controlled. Applying the benchmark to one factor ANOVA procedures in SAS, SPSSX, and BMDP, we found that these packages will produce results with unacceptably low levels of accuracy. These results are produced with no warning to the user. A simple diagnostic check was proposed for the type of problematical data sets examined here. The diagnostic check would allow the packages to detect these ill-conditioned ANOVA problems and incorporate a warning for the user.

References Belsley, D.A. (1984), Demeaning conditioning diagnostics trought centering, The American Statistician 38, 73-77. Simon, S.D. and J.P. Lesage (1988), The impact of collinearity involving the intercept term on the numerical accuracy of regression, Computational Statistics in Economics and Management Science 1,137-152.

Suggest Documents