Comparing the Ability of MS Excel and R while ...

3 downloads 8609 Views 221KB Size Report
Jun 26, 2010 - data. In addition to this Excel has an add-in called as the Data Analysis Tool Pak that can be used for different types of statistical analysis ...
Comparing the Ability of MS Excel and R while Simulating from Poisson Distribution Dibyojyoti Bhattacharjee Reader Department of Business Administration Assam University, Silchar Assam Email: [email protected]

Kishore K. Das Reader Department of Statistics, Gauhati University, Guwahati, Assam, India. Email: [email protected] and

Tanushree Deb Roy Published in Assam University Journal of Science and Technology, Vol. 4, No. II, pp.1−6, 2009.

Submitted to Social Science Research Network (SSRN) on 26 June, 2010 This paper can be downloaded without charge from the Social Science Research Network electronic library at: http://ssrn.com/

Other works of the author in SSRN can be viewed at: http://ssrn.com/author=1335385 ©2010 by Dibyojyoti Bhattacharjee. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.

1

Comparing the Ability of MS Excel and R while Simulating from Poisson Distribution Tanushree Deb Roy1, Dibyojyoti Bhattacharjee2 and Kishore K. Das3 Abstract Simulation from distributions is an important aspect of study in case of statistical software. In this study a comparison is made between the two software viz. Excel and R using their ability to simulation from Poisson distribution. The accuracy of estimates of the parameter of the said distribution was measured through mean sum of squares (MSE). The study suggests that Excel in some of the ways is better than R, at least in cases of simulation from Poisson distribution. Keywords: Excel, R, simulation from Poisson distribution, Mean Square Error. 1. Introduction This study provides a comparison between two software, Excel and R, so far as their ability of simulation from Poisson distribution is concerned. The popularity of these two software is the motivating force behind their selection for comparison. Simulation is the mathematical model to recreate situation, often repeatedly, so that the likelihood of various outcomes can be more accurately estimated. Simulation forms a central part, because of the relative ease with which samples can often be generated from a probability distribution, even when the density function cannot be explicitly integrated (Sharma, 2006). For the purpose of comparison of the two software, random samples were generated of different sizes, for different values of the parameter form the Poisson distribution both in Excel and R. The size of the sample (n) and the value of the parameter (λ) were decided previously. Based on the simulated data, the maximum likelihood estimates of the parameter λ (denoted as λˆ ) were obtained for a fixed sample size and a fixed value of λ. This was replicated ten times and accordingly the mean square error (MSE’s) were obtained for a given sample size and a given value of λ. The procedure was repeated keeping the value of λ fixed and varying the sample size. The MSE values were then plotted for different size of the sample for a fixed value of λ, and accordingly the MSE curves were obtained. The calculations were done separately in Excel and in R and accordingly we get two MSE curves one for Excel and the other for R in the same graph. The software which has smaller MSE values could provide better estimates for λ and according provides a better simulation. The motivation behind the work is a paper of McCullough (1998), where it was suggested that generation of random numbers should be one of the characteristics of software comparison. 1

Department of Statistics, G. C. College, Silchar-788004. Department of Business Administration, Assam University, Silchar-788011 3 Department of Statistics, Gauhati University, Guwahati-781014 2

2

2. Literature Review

Several writers have reviewed statistical software for microcomputers and offered very useful comments to both users and vendors. Some of these reviews are comprehensive and general (Searle, 1989). The American Statistical Association recommended a comprehensive study of the performance of statistical packages (Francis, Heiberger and Velleman, 1975). This project, subsequently modified and published in monograph form (Francis, 1979, 1981) was the first systematic attempt to evaluate the performance of the software used in academics and industry for critical statistical applications. Other literature in this regard include Longley (1967), Wampler (1970), Wilkinson and Dallal (1977), Anscombe (1967), Hayes (1982), Wilkinson (1985), Simon and Lesage (1988, 1989), Wilkinson (1994), Buja, Cook and Swayne (1996), Kn¨usel (1998), Rogers, Filliben and others (1998), L’Ecuyer (1999) etc. Okunade and others (1993) compared the output of summary statistics of regression analysis in commonly statistical and econometrical packages such as SAS, SPSS, SHAZM, TSP, and BMDP. Oster (1998) reviewed five statistical software packages (EPI INFO, EPICURE, EPILOG PLUS, STATA, and TRUE EPISTAT) according to criteria that are of most interest to epidemiologists, biostatisticians and others involved in clinical research. McCullough (1998) proposed testing the accuracy of statistical software packages using Wilkinson’s Statistics Quiz in three areas: linear and nonlinear estimation, random number generation, and statistical distributions. Later McCullough (1999) applied his methodology to the statistical packages SAS, SPSS, and S-Plus. Zhou and others (1999) reviewed five software packages that can fit a generalized linear mixed model for data with more than a two-level structure and a multiple number of independent variables. Bergmann and others (2000) Compared 11 statistical packages on a real dataset. These packages are SigmaStat 2.03, SYSTAT 9, JMP 3.2.5, S-Plus 2000, STATISTICA 5.5, UNISTAT 4.53b, SPSS 8, Arcus Quickstat 1.2, Stata 6, SAS 6.12, and StatXact 4. They found that different packages could give very different outcomes for the Wilcoxon-Mann-Whitney test. 3. About the Software Packages

MS Excel is a window-based spreadsheet developed by Microsoft Corporation. It includes all features of a spreadsheet package such as recalculation, graphs and functions. It provides many statistical, financial and scientific functions, which is the reason of its wide acceptability in different scientific and engineering environments for analyzing data. In addition to this Excel has an add-in called as the Data Analysis Tool Pak that can be used for different types of statistical analysis including simulation from distributions. To activate the Data-Analysis tools, is to choose Tools--Add-Ins and click on the box next to Analysis ToolPak. If this option is not available, we will need to add it to our Excel installation. R is an integrated software facility for data manipulation, calculation and graphical display. It has a suite of operators for calculations on arrays, a large, coherent and integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and a well developed, simple and effective programming language which

3

includes conditionals, loops, and user defined recursive functions and input and other facilities. Within R many modern statistical techniques have been implemented the benefit of which are enjoyed by many statistical software packages including SPSS. Thus, R has been widely accepted in the scientific world in general and statistical community in particular. 3. Methodology of Comparison

Simon Denis Poisson (1781-1840) discovered the Poisson distribution. It was published in the year 1837. A random variable X is said to follow Poisson distribution if it assumes only non-negative values and its probability mass function is given by e −λ λx ; x = 0, 1, 2, … and λ > 0 P(X= x) = x! where λ is the parameter of the Poisson distribution. If X = (X1, X2, …,Xn) is a random sample from a Poisson distribution then the maximum likelihood estimator (MLE) of the parameter of the Poisson distribution is given by, 1 n … (1) λˆ = λˆ ( X ) = ∑ xi = x n i =1 3.1 Mean Square Error The quantity Eλ( λˆ (X) – λ)2 is called the mean square error of t(X) about λ.

MSE ( λˆ (X)) = Eλ [ λˆ (X) – λ]2 = Vλ ( λˆ (X)) + [Bias ( λˆ (X), λ)]2 where Vλ( λˆ (X)) is the variance of t. If t is unbiased for λ, MSE( λˆ ) reduces to V( λˆ (X)). For Poisson distribution MSE is calculated by the formula,1 N … (2) MSE = ∑ (λˆi −λ ) 2 N i =1 N = number of samples considered where, λˆ gives the estimate of λ in the ith repetition i

and λ = mean of Poisson distribution. It is obvious from the formula that a good estimate of λ will be closer to the actual value and accordingly will have a smaller MSE value.

3.2 Simulation and Random Number Generators For simulation of random numbers from Poisson distribution in Excel, we click ‘Tools → Data Analysis→Random Numbers’. From the window that appears we select Poisson distribution and hence provide the inputs viz. mean of Poisson distribution and sample size etc. and in a new worksheet we get the required sample. The mean of the sample observations gives the mle of the Poisson parameter. So far as R is concerned the general command for getting a random sample from a Poisson distribution is rpois(n, lambda), where n is the number of random numbers desired i.e. the sample size and lambda is a real number which is the value λ considered. The following command sequence can be used continuously to get the required random sample of a given size from a Poisson distribution for a given value of λ and the corresponding mle of λ.

4

> x mean(x) Samples of sizes n = 5, 10, 15, 20, 25 and 30 are generated for the values of λ as 0.2, 0.5, 0.8, 1.2, 1.5, 1.75, 2 and 2.5 and accordingly the mle of λ i.e. λˆi are obtained. For each sample size and for each value of λ this is repeated ten times and the corresponding MSE values are calculated. 3.3 MSE Curves For obtaining the MSE curves, the MSE values of the estimates obtained (using the procedure discussed in 3.2) were plotted against their sample sizes for a fixed value of λ. Since the MSE values are obtained from two sources viz. Excel and R accordingly we have two MSE curves in the same graph making the comparison feasible. Since eight values of λ were considered viz. 0.2, 0.5, 0.8, 1.2, 1.5, 1.75, 2 and 2.5 so we would have eight graphs in all. It is obvious that the MSE curve which remains closer to the x axis provides a better estimate. 4. Calculation and Results Based on the methodology discussed above, simulations from specific software and relevent calculations were performed and accordingly the graphs were drawn. The MSE values and the graphs were provided in Appendix I and II respectively. From the graphs it becomes obvious that R cannot be considered as a better simulator of Poisson distribution compared to Excel especially for smaller sizes of the sample (viz. 5 and 10). Even in samples of sizes more than ten often the performance of Excel is better than R though not always. Another important point to note is that with increase in sample size the values of MSE is supposed to decline for obvious reasons. But R is not a very good observant of that rule as well mention can be made of figure 2, 3, 5 etc. where the MSE values are showing irregular jumps at the tails. 5. Future Directions Inspite of the wide acceptability of R it is strange to see that the software could not provide reliable simulation from Poisson distribution. Excel which is not recommended for statistical analysis by many thinkers of the domain (See Kn¨usel (1998)) seems to perform better. With increase in the number of repetations things may change but very marginally. However it is essential to take up such comparitive study for other commonly used statistical distributions and hence a conclusion should be reached on the simulation capability of R. References • Anscombe, F. (1967). Computing in Statistical Science through APL, SpringerVerlag, New York. • Bergmann, R., Ludbrook, J., and Spooren, W. (2000), Different Outcomes of the Wilcoxon-Mann-Whitney Test From Different Statistical Packages, The American Statistician, 54,72-77. • Buja, A, Cook, D. and Swayne, D. F. (1996). Interactive high-dimensional data visualization. Journal of Computational and Graphical Statistics, 5(1), 78-99. 5

• • • • • • • • • • • • • •

• • • • •

Dallal, G. E. (1992). The Computer Analysis of Factorial Experiments With Nested Factors, The American Statistician, 46,240. Francis, I. (1979). A Comparative Review of Statistical Software. International Association for Statistical Computing, Voorburg, Netherlands. Francis, I. (1981). Statistical Software: A Comparative Review, North Holland, New York. Francis, I., Heiberger, R.M., and Velleman, P.F. (1975). Criteria and considerations in the evaluation of statistical program packages, The American Statistician, 29, 52-56. Harris, Mathew. Microsoft Excel 2000 Programming in 21 Days, SAMS Techmedia. Hayes, A. (1982). Statistical Software: A Survey and Critique of its Development, Office of Naval Research, Arlington, VA Kn¨usel, L. (1998). On the accuracy of statistical distributions in Microsoft Excel 97. Computational Statistics and Data Analysis, 26, 375-377. L’Ecuyer, P. (1999). “Random Number Generation” in Handbook on Simulation, ed. J. Banks, New York: Wiley, 93-138. McCullough, B.D. (1998). Assessing The Reliability of Ststistical Software: Part I',The American Statistician, Vol.52, No.4, pp.358-366. McCullough, B.D. (1999) Assessing The Reliability of Ststistical Software: Part II', The American Statistician, Vol.53, No.2, pp.149-159 Mukhopadhyay, Parimal (2002). Mathematical Statistics, Books & Allied (P) Ltd. ,Calcutta 700009. Okunade, A., Chang, C., and Evans, R. (1993), “Comparative Analysis of Regression Output Summary Statistics in Common Statistical Packages,” The American Statistician, 47,298-303. Oster, R. A. (1998), “ An examination of Five Statistical Software Packages for Epidemiology,” The American Statistician, 52,267-280. Rogers, J., Filliben, J., Gill, L., Guthrie, W., Lagergren, E., and Vangel, M. (1998). StRD: Statistical Reference Datasets for Assessing the Numerical Accuracy of Statistical Software, NIST TN# 1396, National Institute of Standards and Technology. Sawitzki, G. (1994). 'Testing Numerical Reliability of Data Analysis Systems', Computational Statistics & Data Analysis, Vol.18, No.2, pp.269-286 Searle, S. R. (1989), “Statistical Computing Packages: Some Words of Caution,” The American Statistician, 43,189-190. Searle, S. R. (1994), “Analysis of Variance Computing Package Output for Unbalanced Data From Fixed Effects Models with Nested Factors,” The American Statistician, 48,148-153. Sharma, J. K (2003). Operations Research: Theory And Applications, Macmillan India Ltd., New Delhi 110002. Simon, S. D., and Lesage, J. P. (1988). Benchmarking numerical accuracy of statistical algorithms. Computational Statistics and Data Analysis, 7, 197-209.

6

• • • • • •

Simon, S. D., and Lesage, J. P. (1989). Assessing the accuracy of ANOVA calculations in statistical software. Computational Statistics and Data Analysis, 8, 325-332. Wampler, R.H. (1970). A report on the accuracy of some widely used least squares computer programs. Journal of the American Statistical Association, 65, 549-565. Wilkinson, L. (1985). Statistics Quiz, Evanston, IL: SYSTAT, Inc. (available at http://www.tspintl.com/benchmarks). Wilkinson, L (1994). Practical Guidelines for Testing Statistical Software. In Computational Statistics, eds. P. Dirschedl and R¨udiger Ostermann, Berlin: Physica-Verlag, 111-124. Wilkinson, L. and Dallal, G.E. (1977). Accuracy of sample moments calculations among widely used statistical programs. The American Statistician, 31, 128-131. Zhou, X., Perkins, A., and Hui, S. (1999), “Comparisons of Software Packages for Generalized Linear Multilevel Models,” The American Statistician, 53,282290.

7

APPENDIX I

MSE Values for different values of λ

n 5 10 15 20 25 30

λ=0.2 Excel 0.036 0.03 0.012889 0.00925 0.00688 0.003222

n 5 10 15 20 25 30

λ=0.8 Excel 0.112 0.059 0.02933 0.02725 0.02496 0.032222

n 5 10 15 20 25 30

λ=1.5 Excel 0.25 0.103 0.068667 0.042 0.0244 0.026111

n 5 10 15 20 25 30

λ=2 Excel 0.312 0.0269 0.1253333 0.1075 0.0792 0.046556

R 0.052 0.026 0.027556 0.01 0.00624 0.005333

R 0.196 0.045 0.040889 0.03075 0.0392 0.034222

R 0.634 0.177 0.058 0.0755 0.11944 0.031

R 0.396 0.181 0.227556 0.094 0.02688 0.100556

n 5 10 15 20 25 30

λ=0.5 Excel 0.034 0.052 0.028667 0.01175 0.00488 0.005556

R 0.202 0.041 0.021556 0.0135 0.0068 0.016556

n 5 10 15 20 25 30

λ=1.2 Excel 0.304 0.138 0.099556 0.112 0.07312 0.015889

R 0.376 0.14 0.085778 0.03575 0.07648 0.020111

n 5 10 15 20 25 30

λ=1.75 Excel 0.2625 0.1845 0.170722 0.10125 0.05082 0.0345

R 0.2845 0.1785 0.090944 0.20825 0.06634 0.056278

n 5 10 15 20 25 30

λ=2.5 Excel 0.298 0.255 0.259778 0.1565 0.13416 0.037444

R 0.474 0.378 0.125556 0.1985 0.04072 0.072667

8

APPENDIX II

MSE curves from Excel in R for different values of λ MSE curves for

λ

= 0.2

0.06

MSE Values

0.05 0.04 0.03 0.02 0.01 0 0

10

20

30

Sample Size

MSE curves for

λ

40

Excel R

= 0.5

0.25

MSE Values

0.2 0.15 0.1 0.05 0 0

10

20 Sample Size

9

30

40

Excel R

MSE curves for

λ

= 0.8

0.25

MSE Values

0.2 0.15 0.1 0.05 0 0

10

20

30

Sample Size

MSE curves for

λ

40

Excel R

= 1.2

0.4 0.35 MSE Values

0.3 0.25 0.2 0.15 0.1 0.05 0 0

10

20 Sample Size

10

30

40

Excel R

MSE curves for

λ

= 1.5

0.7

MSE Values

0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

Sample Size

Excel R

MSE Curves for λ = 1.75 0.3

MSE Values

0.25 0.2 0.15 0.1 0.05 0 0

10

20 Sample size

11

30

40

Excel R

MSE curves for λ = 2 0.45 0.4 MSE Values

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

10

20

30

Sample Size

40

Excel R

MSE Values

MSE curves for λ = 2.5 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

10

20 Sample Size

12

30

40

Excel R