Benchmarking numerical accuracy of statistical ... - ScienceDirect

29 downloads 4003 Views 957KB Size Report
Computational Statistics & Data Analysis 7 (1988) 197-209. North-Holland ... benchmark for regression procedures in the package under review. We illustrate ...
Computational Statistics & Data Analysis 7 (1988) 197-209 North-Holland

197

Benchmarking numerical accuracy of statistical algorithms Stephen

D. SIMON

Department of Applied Statistics and Operations Research, Bowling Green State University, Bowling Green, OH, USA

James P. LESAGE Department

of Economics, University of Toledo, Toledo, OH, USA

Abstract: In this paper, we discuss benchmark data sets proposed by Anscombe and Longley for measuring the numerical accuracy of statistical algorithms. We show that these benchmarks present an unduly optimistic assessment of numerical accuracy. We demonstrate that the cause of this unwarranted optimism is the use of integer values in the benchmarks. Alternative benchmarks are proposed which avoid the problems brought about by the integer values and provide a more realistic assessment of numerical accuracy under varying data conditions. Keywords: Anscombe benchmark, Longley benchmark, standard deviation, multiple regression.

1. Introduction

The dramatic rise in statistical software for personal computers has increased the number of practitioners who perform sophisticated data analytic techniques. This has placed an unusual burden on the statistical community to validate the software with regard to numerical accuracy. Valid algorithms should provide accurate computations when faced with mildly ill-conditioned data, and they should provide a warning to users when a severely ill-conditioned data set is encountered. These warnings may be the only indication that practitioners have regarding the impossibility of computing accurate answers. Twenty years ago an article by Longley (1967) caught the attention of the statistical community, when he pointed out the sad state of numerical accuracy in statistical package regression algorithms. In the twenty years since this article, the statistical community has developed an impressive array of benchmark data sets which have been used to validate new software. We argue that two of the most frequently used data sets, Longley (1967) used to benchmark multiple linear regression procedures, and Anscombe (1967), used to benchmark standard deviation calculations are biased towards overstating the numerical accuracy of al0167-9473/M/$3.50

0 1988, Elsevier Science Publishers B.V. (North-Holland)

198

SD. Simon, J. P. Lesage / Benchmarking numerical accuracy

gorithms. The widespread use of Longley is apparent in the statistical computing reviews of The American Statistician, where almost every review reports this benchmark for regression procedures in the package under review. We illustrate the tendency of these benchmarks to overstate accuracy and show that the source of this problem lies in the use of integer values by these benchmarks. Alternatives to each of these benchmarks which contain flexible control parameters are proposed. We show that these benchmarks provide a more realistic assessment of numerical accuracy under a variety of problematical data situations. The paper proceeds as follows. Section 2 discusses the Anscombe and Longley benchmarks and their shortcomings. The nature of the problem created by using integer values in benchmarks is set forth here. Section 3 introduces the alternative benchmarks which not only overcome the integer values problem but allow numerical accuracy to be examined in a continuum of ill-conditioning ranging from mild to severe. Both benchmarks allow us to increase the severity of cancellation error and of accumulation error. In addition the alternative to the Longley benchmark contains parameters that allow ill-conditioning arising from different types of collinear relations to be explored. Section 3 also presents the results of applying the new benchmark to the regression procedure in the SAS statistical package.

2. The Anscombe and Langley benchmarks Anscombe (1967) proposed a benchmark to test the numerical accuracy of variance and standard deviation calculations. A general form of the benchmark is Xi=y+i,

i=l,...,

9,

(1)

where y is a large number, typically a power of ten. The variance of the Xi is 7.5, assuming n - 1 is used in the denominator of the variance calculation. Anscombe contends that as y increases, accurate computation of the variance (or standard deviation) becomes more difficult. The Anscombe benchmark is useful for illustrating the inadequacy of the notorious desk top calculator algorithm, also known as the one-pass algorithm. We show that this benchmark makes no distinction between the varying accuracy associated with the one-pass, two-pass, and provisional means algorithms. The problem with the Anscombe benchmark derives from its use of integer values which have an exact binary representation. The exact binary representation of integer values allows these magnitudes to be stored without truncation in a computer. The lack of truncation error in storing the initial benchmark values will allow subsequent calculations to proceed much more accurately than would be the case if inexact binary representations existed at the outset. To illustrate the

S. D. Simon, J. P. LRsage / Benchmarking numerical accuracy

199

Table 1 Decimal digits of accuracy for one-pass, two-pass, and provisional means algorithms

Algorithm

Original Anscombe

2-Pass P-Means l-Pass

12+

12 +

12+ 12+

10.8 10.5

y=lOO

2-Pass P-Means l-Pass

12+ 12+ 12+

11.8 10.1 8.3

y =lOOO

2-Pass P-Means l-Pass

12+ 12+ 12+

10.4 9.0 6.5

y=lOOOO

2-Pass P-Means 1-Pass

12+ 12+ 12+

12 + 8.04.2

y=lOOOOO

2-Pass P-Means l-Pass

12+ 12+ 12+

12 + 7.0 2.5

y =lE+7

2-Pass P-Means l-Pass

12+ 12+ 0.0

7.4 6.0 0.5

y=lO

Modified Anscombe

Notes: Digits of accuracy reported were calculated using the formula act = -log,,{ abs[( B - e)/e]), where 4 is the computer estimate and 8 is the true value. The algorithms are coded in Turbo Pascal, with uses 6 byte real numbers. The mantissa is 40 bits long, providing slightly more than 12 digits of accuracy.

nature of truncation or rounding error in storing decimal numbers, consider the binary representation of the decimal 0.1, shown in (2). 0001 1001 1001 1001 . . . (2) This representation has to be rounded or truncated when stored in a computer. The binary representation of tenths and hundredths produces a truncation situation analogous to the familiar decimal representation of thirds and ninths. A slight modification of the Anscombe benchmark which will illustrate the problems inherent in the use of integer values for benchmark data sets involves dividing each Anscombe integer value by ten, i.e., Xi = y/10 + i/10, i = 1,. . . ,9. This produces numbers which for the most part have inexact binary representations, allowing us to explore the implications of starting with inexact instead of exact binary representations. Table 1 compares the results of using the original Anscombe and the modified Anscombe benchmarks to test the numerical accuracy of three major algorithms for computing the standard deviation. We measure accuracy throughout this paper with a formula from Wampler (1980) that can be thought of as representing digits of accuracy. The formula is shown in (3). accj = - log,,{ abs[ ( t9j- t.)/ej] )

(3)

200

S.D. Simon, J.P. Lesage / Benchmarking numerical accuracy

where 4 and Bj are the estimated and actual coefficients respectively, and log,, is a base 10 logarithm. Digits of accuracy can be thought of as the base 10 log of the relative accuracy. As an example, assume Bj = lo!1 and 6j = 1000. Then according to (3) the digits accuracy will be 3.0. When 6’ equals 8, we can say that the digits accuracy is at least as accurate as the number of digits shown. That is, if t? and 0 both equal 1234.56, then we have at least 6 digits of accuracy. The Table 1 results clearly illustrate the impact of starting values with exact versus inexact binary representations. The original Anscombe benchmark produces digits of accuracy results which show no accuracy differences between the three algorithms. Twelve digits of accuracy are reported throughout with the exception of the largest value for y. This incorrectly implies that all three algorithms provide accurate results for all but the most extreme cases. The modified Anscombe benchmark shows a clearer distinction between the three algorithms. Declining accuracy is reported for both the provisional means and one pass algorithms as y increases in magnitude. The modified version of the benchmark also shows that the two-pass algorithm maintains an exceptionally high level of accuracy for all but the largest value of y. This modification allows us to order the three algorithms from most accurate (two-pass) to the least accurate (one-pass). These results show that the original Anscombe benchmark provides a more optimistic view of accuracy than the modified benchmark. The reason for this is the error caused by the inexact binary representation. This error is amplified by further calculations in the modified benchmark. In contrast the exact binary representations in the original benchmark have no error to be amplified by subsequent calculations. Table 2 shows an example of how small errors are amplified by subsequent calculations. The table shows a centered version of the original and modified Anscombe benchmark with y = 1000000. Centering causes a cancellation error of approximately 18 binary digits, which does not affect the original benchmark, since 20 of the 40 binary digits are not important given the exact representation. Centering does however, have a severe impact on the modified benchmark. After centering, approximately 22 binary digits of accuracy remain, with the 18 binary digits lost to cancellation error. This translates to about 5.4 decimal digits of accuracy which are lost. These results lend some credence to a statement in Belsley (1986) that “an ill-conditioned transformation to obtain better conditioned data may be computationally, jumping from the frying pain into the fire”. It is clear that centering creates better computationally conditioned data allowing subsequent calculations to proceed with greater accuracy, but, for some data sets this better conditioning comes at a price. We do not doubt the computational usefulness of centering, but neither do we wish this procedure to be viewed as a panacea. We believe that benchmarks like the original Anscombe have led the statistics community to overemphasize the importance of centering. There are those who have labeled ill-conditioning which is reduced by centering to be “non-essential ill-conditioning” (See and Marquardt

201

S. D. Simon, J. P. Lesage / Benchmarking numerical accuracy Table 2 Centering of the Anscombe and modified Anscombe benchmarks

Original benchmark with y = 1000000 before centering Decimal

Binary

1000001 1000002 1000003 1000004 1000005 1000006 1000007 1000008 1000009

1111 1111 1111 1111 1111 1111 1111 1111 1111

0100 0100 0100 0100 0100 0100 0100 0100 0100

0010 0010 0010 0010 0010 0010 0010 0010 0010

0100 0100 0100 0100 0100 0100 0100 0100 0100

ooomooo 0010.0000 0011.0000 0100.0000 0101 .OOOO 0110.0000 0111.0000 1000.0000 1001.0000

Original benchmark with y = 1000000 after centering Decimal

Binary

-4 -3 -2 -1

- Kmlooo

0 1 2 3 4

0 00 00

- 11.0000 - lO.oooo -1.0000 O.OOQO 1.0000 10.0000 11.0000 100.0000

ooo ooo 00 00 0

Modified benchmark with y = 1000 000 before centering Decimal

Binary

100000.1 100000.2 100000.3 100000.4 100000.5 100000.6 100000.7 100000.8 100000.9

11000 11000 11000 11000 11000 11000 11000 11000 11000

0110 0110 0110 0110 0110 0110 0110 0110 0110

1010 1010 1010 1010 1010 1010 1010 1010 1010

oooo.ooo1 0000.0011 0000.0100 0000.0110 0000.1000 0000.1001 0000.1011 0000.1100 0000.1110

100 001 110 011 000 100 001 110 011

Modified benchmark with y = 1000 000 after centering De&Ml

Binary

-

- 0.0110 - 0.0100 - 0.0011 -0.0001 0.0000 0.0001 0.0011 0.0100 0.0110

0.399999976 0.299999952 0.199999928 0.099999905 0.0 0.099999905 0.199999928 0.299999952 0.399999976

0110 1100 0011 1001 0000 1001 0011 1100 0110

0110 1100 0011 1001 0000 1001 0011 1100 0110

0 0

00

ooo ooo 00 0 0

202

S. D. Simon, J. P. Lesage / Benchmarking numerical accuracy

Table 3 The LongIey benchmark data and variables names Implicit Price Deflator

Gross National Product

unemployed

size of armed forces

populace 14 years and older

time trend

total employment

83.0 88.5 88.2 89.5 96.2 98.1 99.0 100.0 101.2 104.6 108.4 110.8 122.6 114.2 115.7 116.9

234 289 259 426 258054 284 599 328 975 346 999 365 385 363 112 397 469 419180 442 769 444 546 482 704 502 601 518 173 554 894

2356 2325 3682 3351 2099 1932 1870 3578 2 904 2 822 2 936 4681 3 813 3931 4 806 4007

1590 1456 1616 1650 3099 3 594 3 547 3 350 3 048 2857 2798 2637 2552 2514 2572 2827

107 608 108 632 109 773 110 929 112075 113270 115 094 116219 117 388 118 734 120445 121950 123 366 125 368 127 852 130081

1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962

60 323 61122 60171 61187 63 221 63 639 64989 63 761 66019 67 857 68 169 66513 68655 69 564 69331 70551

1984), since they believe that centering mitigates any problems inherent in this type of ill-conditioning. Our discussion and illustrations indicate that this would only be true for integer valued data sets, which we might in a joking manner label “non-essential data”. Joking aside, the point here is that benchmarks should not assume integer valued data since there is no guarantee that an algorithm will see only integer valued data. It should be stressed that this investigation only addresses a small part of the “non-essential ill-conditioning” debate, that of numerical accuracy. The investigation may however clear up some of the confusion regarding computational and statistical issues which seem somewhat blurred in this debate. We now illustrate that these same problems are encountered with another widely cited benchmark for testing the numerical accuracy of regression algorithms proposed by Longley (1967). This benchmark data set is shown in Table 3. In the table, the last column, total employment, is the dependent variable, while the remaining columns represent independent variables. A constant term not shown in Table 3 should be included when performing the analysis. The Longley benchmark has the same problems as the Anscombe benchmark. Each of the independent variables except the first represent only integer values, leading to an exact binary representation. In addition, since there are exactly 16 observations, the mean for all but the first independent variable is an integer total divided by 16, resulting in an exact binary representation of these means. Using our reasoning from the discussion of the Anscombe benchmark, we would expect that the Longley benchmark provides an overly optimistic assessment of numerical accuracy. Table 4, shows the numerical accuracy of the original Longley

S.D. Simon, J.P. Lesage / Benchmarking numerical accuracy Table 4 Longley digits of accuracy

for regression

a. Decimal Digits

Original Longley Modified Longley Notes: where The whose

procedure

ofAccuracy for

203

in MINITAB

MINITAB

Regression estimates population 14 years and older

time trend

constant

Implicit Price Deflator

Gross National Product

unemployed

size of armed forces

6.84

4.87

5.90

6.45

6.93

5.13

6.87

3.94

3.25

3.54

4.19

4.59

3.24

3.95

digits of accuracy reported were calculated using the formula 8 is the computer estimate and 0 is the true value. column headings represent the variable names associated accuracy is reported in the columns.

act = -log,,{ with

abs[( 0 - f?)/@]},

the coefficient

estimates

benchmark along with a modified Longley benchmark which divides all variables by ten. Notice that the original Longley tends to report a greater number of decimal digits of accuracy for the MINITAB linear regression procedure. The reduction in reported digits of accuracy that arises from using the modified Longley benchmark is around 1.5 to 3.0 digits, which is fairly substantial. To summarize, the Anscombe and Longley benchmarks provide an overly optimistic assessment of numerical accuracy. This optimism results from their use of integer values with exact binary representations. These benchmarks may have created the impression that centering removes “non-essential ill-conditioning”. We show, however, that this is true only for integer valued data sets and that non-integer valued benchmarks provide a more realistic assessment of accuracy. The process of validating software must insure the accuracy for all data sets, including, and perhaps focusing on, worst case scenarios. It should be clear that these worse cases would be where ill-conditioning and non-integer values are prevalent. An additional problem with the Longley and Anscombe data set is their limited scope. There are many factors which influence accuracy including the size of the data set. We believe that some of the small benchmarks such as Anscombe which contains 9 rows, and Longley which contains 16 rows, are problematical when typical applications often involve hundreds or thousands of rows of data. While the modified Anscombe and Longley benchmarks provide more realistic assessment of accuracy, they suffer from the same limited scope as the original benchmarks.

3. Alternative benchmarks In Section 2 it was shown that both the Longley and Anscombe benchmarks present an overly optimistic view of numerical accuracy. We demonstrated that a simple division by ten produces benchmarks which ameliorate this problem.

204

S. D. Simon, J. P. Lesage / Benchmarking numerical accuracy

However, these simple modifications lack the flexibility to test numerical accuracy over a wide range of problematical data conditions. We propose two benchmarks here that provide flexible control parameters which can be used to generate benchmark data sets with varying types of ill-conditioning. These benchmarks allow us to study the influence of cancellation error - the subtraction of two numbers roughly equal in magnitude, and accumulation error - the gradual build-up of small errors in lengthly calculations. First, we propose a two-parameter alternative to the Anscombe benchmark shown in (4) and (5). X;=y++(i),

i=l,...,I,

I=anoddinteger

(4

where 0.2 +(i) = 0.3 i 0.1

if i=l, if i=2, 4, 6 ,..., if i=3,5,7 ,....

(5)

The parameter y controls how nearly constant the data values are. As y increases, the severity of cancellation error increases. The parameter y can be varied through a range of values allowing the benchmark data set to produce data which represents a continuum of ill-conditioning, ranging from mild to severe. The parameter I controls the size of the data set. As I increases, the number of computations increases, increasing the severity of accumulation round-off error. These two control parameters allow the benchmark to generate data sets which can test the ability of diagnostic checks to detect and warn for severe cases of both cancellation and accumulation error. Table 5 shows the results of using the Modified Anscombe benchmark to compare the three algorithms explored in Section 2, the one-pass, two-pass, and the provisional means algorithms. The benchmark results illustrate the superiority of the two-pass algorithm over a wide range of data conditions. The results also indicate that the one-pass algorithm is very sensitive to both cancellation and accumulation error. The results in Table 5 point out the importance of diagnostic checks which test for the simultaneous presence of nearly constant data values and a large number of observations, since, this combination of circumstances has the greatest degrading impact on numerical accuracy. The second benchmark we propose represents a five-parameter modification of the Wampler (1980) benchmark to test the accuracy of linear regression algorithms. This benchmark is a superior alternative to the Longley benchmark. While Longley represents a single extreme case of ill-conditioning, the original Wampler benchmark contains two control parameters which allow for a continuum of ill-conditioning ranging from mild to extreme. Lesage and Simon (1985a) used the original Wampler benchmark to test the accuracy of multiple linear regression procedures in several micro-computer statistical packages. A three-parameter modification of the original Wampler benchmark was used in Lesage and Simon (1985b) to examine the impact of centering and scaling on the numerical accuracy of statistical algorithms, and by Simon and Lesage (1986) to

S. D. Simon, J. P. L.esage / Benchmarking

numerical accuracy

205

Table 5 Digits of accuracy for one-pass, two-pass and provisional means using alternative benchmark

y=io y=lOO

y =lOOO

y =lOOOo

y=100000

y =lE+7

2-Pass P-Means l-Pass 2-Pass P-Means l-Pass 2-Pass P-Means l-Pass 2-Pass P-Means l-Pass 2-Pass P-Means l-Pass 2-Pass P-Means 1 -Pass

I= 21

I = 201

1=2001

11.9 11.0 9.0 10.8 10.2 7.3 9.6 9.2 5.7 7.8 7.1 1.2 6.6 6.0 0.0 6.0 5.2 0.0

10.7 10.7 8.0 10.3 10.3 7.1 9.6 9.2 5.7 7.8 6.9 0.5 6.6 5.9 0.0 6.2 5.0 0.0

9.6 9.5 7.4 9.3 9.6 5.4 9.2 9.4 3.5 7.7 6.9 0.0 6.3 5.7 0.0 5.0 4.3 0.0

Notes: Digits of accuracy reported were calculated using the formula act = - log,,{ abs[( B - 8)/e]}, where 6 is the computer estimate and 0 is the true value. The algorithms are coded in Turbo Pascal, which uses 6 byte real numbers. The mantissa is 40 bits long, providing slightly more than 12 digits of accuracy.

explore the impact of collinear relations involving the intercept term. In the latter study, it was demonstrated that the diagnostic checks in the widely used mainframe statistical package SAS, and SPSS-X, are unable to detect collinearities involving the intercept term. Here we propose a further modification to the modified Wampler benchmark found in Simon and Lesage (1986). The proposed five parameter Wampler benchmark is shown in Figure 1. The modifications made here result in a benchmark which takes into account the accumulation round-off error. The new benchmark subsumes the original wampler benchmark as well as the three-parameter modification as special cases. The parameter n controls the number of independent variables. Setting the parameter n = M will generate M - 1 independent variables including the constant term. A parameter b controls the rectangularity of the data set, such that, b = 1 implies a nearly square data set having dimension (n by n - 1). Larger values of b imply data sets with many more observations than variables and allow the issue of accumulation round-off error to be explored. The parameters e and y control two types of collinear relations that give rise to ill-conditioning. The first type of collinearity, controlled by y, represents that arising from a separate near linear relation between each independent variable and the intercept column. As y becomes very large, ill-conditioning arising from this type of collinearity worsens, since each independent variable comes closer to

206

S.D. Simon, J.P. Lesage / Benchmarking numerical accuracy /1

(1+y)6

(1+y)P

...

1

(y+e)S

ys2

...

Y

1

YS

(y+z)P

*..

yV2

1

YS

ys2

...

(y++w

US

y62

...

US

(1+ y)s”-2\ n-2

x=.

\l Note:

n-2

X is an (nb by n - 1) matrix, with the matrix shown repeated (n -l)+(n

b times.

-2)y+r

(n--2)y+e

i

(n-2)y+c

Y=

I

(n-2)y+c

(n-l)+(n-2)y-c

Note:

Y is an (nb by 1) vector, with the vector shown repeated Fig. 1. The 5-parameter

Wampler

b times.

benchmark.

being constant. The second type of collinear relation occurs when e is small. This creates a separate near linear relation between each pair of independent variables. In other words, each pairwise correlation is close to + 1. Finally the parameter 6 controls scaling, with values for S greater than one creating an increase in the column length, and values of 6 less than one decreasing this length. This control parameter would allow the impact of various schemes for scaling and centering of the data matrix to be explored. The five parameter Wampler data encompass the original Wampler data set as a special case. If we let y = 0, S = 1, and b = 1, we have the benchmark originally proposed by Wampler (1980). A three-parameter version of the benchmark where b = 1, and 6 = 1, was employed by Simon and Lesage (1986) to examine centering in the context of numerical accuracy. The study demonstrated that statistical packages using diagnostics based on centered data provide highly inaccurate results without producing warnings to the user. In addition this study found that centering is useful for some algorithms, but, it does not remove ill-conditioning from the data. Both of these findings support the position taken by Belsley (1984) in the “non-essential” ill-conditioning debate. We focus our use of the proposed benchmark on the problem of accumulation round-off, examining benchmark data sets containing numbers of observations ranging from small to very large. The focus on accumulation error was undertaken for two reasons. First, none of the current benchmarks allows this issue to be explored, and second, we are unaware of any diagnostic checks that test for the presence of this problem. If accumulation error is important, the packages should adjust their diagnostics according to the size of the problem.

S. D. Simon, J. P. &sage

/ Benchmarking numerical accuracy

207

Table 6 The 5-parameter Wampler accuracy results SAS PROC REG Digits Accuracy of the Worst Slope Term L

6

b=l

b=lO

b=lOO

b=lOOO

lE-2 lE-2 lE-2 lE-3 lE-3 lE-3 lE-4 lE-4 lE-4

lE+2 lE+3 lE+4 lE+2 lE+3 lE+4 lE+2 lE+3 lE+4

6.3 32.4 4.3 1.4 7.7 0.6 **

7.3 4.5 2.3 4.8 2.8 0.3 2.8 0.6 0.0

7.1 4.2 2.0 5.5 2.2 1.1 3.2 1.4 0.7

7.0 4.5 2.8 4.2 2.2 2.2 3.2 2.5 1.7

5.6 2.6 0.0 6.0 2.5 0.0 6.1 3.0 0.0

4.8 2.4 0.0 4.8 2.4 0.0 5.7 2.3 0.0

SAS PROC REG Digits Accuracy of the Intercept Term lE-2 lE-2 lE-2 lE-3 lE-3 lE-3 lE-4 lE-4 lE-4

lE+2 lE+3 lE+4 lE+2 lE+3 lE+4 lE+2 lE+3 lE+4

7.7 5.3 1.6 7.5 5.0 1.3 1.7 5.4 **

7.2 3.5 1.9 7.0 3.5 0.9 1.5 3.8 1.1

Notes: The parameter n is set at 10, and the scale parameter y is set at 1. SAS GLM regression procedure yielded very similar results. Digits of accuracy reported were calculated using the formula act = -log,,{abs[( 0 - f?)/e]), where e^is the computer estimate and 0 is the true value. The symbol - denotes that SAS printed a warning to the user that the data set was ill-conditioned. The l * symbol indicates less than zero digits of accuracy.

Table 6 shows the results of an investigation of accuracy in SAS using the five parameter Wampler. For simplicity we set the scaling parameter S equal to one and the value of n equal to 10 throughout. As the parameter e decreases, the correlations for each pair of independent variables increases toward + 1. As the parameter y increases, each independent variable grows more nearly constant. Thus we can control two very different sources of ill-conditioning with our control parameters in the benchmark data set. As the number of blocks, b, increases the accumulation error increases. Some of the results in Table 6 parallel results in Simon and Lesage (1986). We see that accuracy declines for either type of ill-conditioning. It is also clear that SAS fails to screen for nearly constant variables and produces inaccurate results with no warning to the user. As the number of blocks increase, the accuracy of the intercept estimate declines consistently. Accuracy of the slope estimates shows no obvious trend as the number of observations increase. Thus, the intercept term is more sensitive to accumulation error than the other estimates. It may be that this sensitivity arises simply because of the position occupied by the intercept column as the first

208

S. D. Simon, J. P. Lesage / Benchmarking numerical accuracy

variable vector entering the algorithm. This sensitivity is currently the subject of another investigation we are undertaking, which should discover the relationship between positioning of the intercept column and accumulation error. Our interpretation of the Table 6 results would be that there is much room for improvement in the diagnostic checks used by SAS to screen out data sets with nearly constant variables. Reporting estimates which have no or few digits of accuracy without warning is clearly problematical. A more extensive investigation of the accumulation error would be necessary before any conclusions could be drawn here. In this area it would seem that a comparison with alternative algorithms would be helpful in judging the severity of these types of problems.

4. Conclusions Two currently popular benchmarks, the Longley and Anscombe data sets were shown to provide unduly optimistic results about numerical accuracy. By dividing each data value in these benchmarks by ten, we get a more realistic assessment of accuracy. The modified benchmarks consist of non-integer values which have inexact binary representations, and provide a more rigorous test of statistical packages and algorithms. In our opinion, the popularity of the Longley and Anscombe benchmarks is unwarranted, since statistical package are not restricted to use with integer valued data sets. The results produced by these benchmarks cannot be considered representative of accuracy provided in routine use. Some alternative benchmarks with greater flexibility which avoid the integer values problem were suggested as replacements for the Longley and Anscombe benchmarks in validating statistical packages.

Acknowledgements

The authors would like to thank an Associate referees for helpful comments on this paper.

Editor and two anonomyous

References F.J. Anscombe, Topics in the investigation of linear relations fitted by the method of least-squares, Journal of the Royal Statistical Society B 29 (1967) 1-129. D.A. Belsley, Demeaning conditioning diagnostics through centering, The American Statistician 38 (1984) 73-77. D.A. Belsley, Centering first differences, and the condition number, in: D.A. Belsley and Edwin Kuh, eds., Model Reliability (MIT Press, Cambridge, MA, 1986). J.P. Lesage and S.D. Simon, Numerical accuracy of statistical algorithms for microcomputers, Computational Statistics and Data Analysis 3 (1985a) 47-57.

S. D. Simon, J. P. Lesage / Benchmarking J.P.

numerical accuracy

209

Lesage and SD. Simon, The impact of centering and scaling on numerical accuracy of regression algorithms, Papers and Proceedings of the IASTED Applied Simulation and Modeling (1985b) 100-103. J.W. Longley, An appraisal of least squares programs for the electronic computer from the point of view of the user, Journal of the American Statistical Association 62 (1967) 819-841. S.D. Simon and J.P. Lesage, The impact of ill-Conditioning involving the intercept on the numerical accuracy of regression, Bowling Green State University and University of Toledo Economics Departments Working Papers Series (1986). R.D. Snee and D.W. Marquardt, Collinearity diagnostics depend on the domain of Prediction, The model, and the data, A comment, The American Statistician 38 (3) (1984) 83-87. R.H. Wampler, Test procedures and problems for least-squares algorithms, Journal of Econometrics 12 (1980) 3-22.

Suggest Documents