On Using Empirical Distribution Function Plot for ... - Papers.ssrn.com

1 downloads 0 Views 315KB Size Report
On Using Empirical Distribution Function Plot for Checking the Normality Assumption of a Data Set. Dibyojyoti Bhattacharjee. Reader. Department of Business ...
1

On Using Empirical Distribution Function Plot for Checking the Normality Assumption of a Data Set Dibyojyoti Bhattacharjee

Reader Department of Business Administration Assam University, Silchar Assam Email: [email protected] and

Kishore K. Das Reader Department of Statistics, Gauhati University, Guwahati, Assam, India. Email: [email protected] Published in Assam Statistical Review, Vol. 22, No. 1, pp. 50−58. 2009

Submitted to Social Science Research Network (SSRN) on 28 June, 2010 This paper can be downloaded without charge from the Social Science Research Network electronic library at: http://ssrn.com/

Other works of the author in SSRN can be viewed at: http://ssrn.com/author=1335385 ©2010 by Dibyojyoti Bhattacharjee. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.

2 On Using Empirical Distribution Function Plot for Checking the Normality Assumption of a Data Set Dibyojyoti Bhattacharjee and Kishore K. Das Abstract Since the normal distribution is the gate keeper to several statistical procedures, so conformity of a given data set to its normality assumptions is often of concern to the statisticians. Numerous parametric and non-parametric tests are available for testing the goodness of fit, some commonly used goodness of fit tests are chi-square goodness of fit test, Anderson Darling test, Kolmogorov-Smirnov (K-S) test, Wilks Shapiro Normality test, etc. Some graphical techniques are also available to check the goodness of fit to some of the hypothetical distributions under consideration like normal probability plot, quantile plot etc. The aim of this paper is to develop a graphical tool based on empirical distribution function and the Kolmogorov-Smirnov (K-S) statistic so that the normality of a given dataset can be checked as well as visualized. Key words: Empirical Distribution Function, Kolmogorov-Smirnov statistic, Statistical Graphics, distribution fitting. 1. Introduction Checking the normality of a data set is a routine exercise for statisticians, especially those involved in data analysis. There are a number of efficient statistical tests both parametric and non-parametric for the purpose. Most of the statistical software has provided its user with a multiple number of options for the same purpose like the Anderson Darling test, Kolmogorov-Smirnov (K-S) test, Wilks Shapiro Normality test, etc. However, many graphical techniques are also available to check the goodness of fit to the hypothetical distribution under consideration, mention can be made of probability plot developed by Wilk

3 and Gananadesikan (1968), Quantile plot, hanging rootogram by Tukey (1977) etc. Literature of Kotz and Johnson (1988), Rice (1995) and others referred to such graphical techniques as informal techniques for data analysis and model assessment. But Fisher (1983), Bickel and Doksum (1977), Shapiro (1990), Satten (1995), Kendall and Stuart (1991) described various graphical techniques, which can be used for formal statistical inference. 2. The Plot and its Logic Let (yi, i=1,2,…,n) be a random sample of size ‘n’ on Y, let F(y) be the corresponding cumulative distribution function (CDF) of the random variable. Also, let (y(i), i=1,2,…,n) denotes the corresponding order statistics. The empirical CDF is given by, Fn(y) = (no. of observations ≤ y) / n Thus, this function will be a step function lying between 0 and 1, with jumps of size 1/n at each observation. This empirical distribution function popularly called as the EDF, is used as the non–parametric estimate of the actual CDF. Under the assumption that the data of head length from Anderson (1958) follows normal distribution we draw the corresponding CDF along with the EDF against the corresponding values of y, as in Figure 1. It is obvious that in case of a good fit the EDF plot i.e. the step function will grow around the plotted CDF. The calculations for the data along with the expected probabilities for raw data from Anderson (1958) have been shown in Table 1. Table 1: EDF values for the Anderson Data Head

EDF

length

Expected

Head

F(y)

Length

EDF

Expected F(y)

163

0.041667

0.010201

188

0.541667

0.599821

174

0.083333

0.117556

188

0.583333

0.599821

174

0.125

0.117556

189

0.625

0.638988

175

0.166667

0.139089

190

0.666667

0.676748

4 Head

EDF

length

Expected

Head

F(y)

Length

EDF

Expected F(y)

176

0.208333

0.16316

191

0.708333

0.71277

176

0.25

0.16316

192

0.75

0.746771

179

0.291667

0.250492

192

0.791667

0.746771

181

0.333333

0.320177

195

0.833333

0.834717

181

0.375

0.320177

195

0.875

0.834717

183

0.416667

0.396867

197

0.916667

0.880744

183

0.458333

0.396867

197

0.958333

0.880744

186

0.5

0.518801

208

1

0.989564

3. The Problem and its Solution However the goodness or lack of fit may not be always as clear as this one (refer Figure 1). For some data set it may be controversial to decide about the normality of the data set by comparing the CDF curve and the corresponding graph of the EDF alone. A solution to this problem would be to derive a confidence band around the graph of the theoretical CDF. The band, width of which will be dependent on the level of significance can be derived using the Kolmogorov – Smirnov statistic, which is the famous non– parametric test, that is used to test if a given random sample comes from a specific distribution and is based on the EDF. The test statistic is given by D = Sup |Fn(y)–F(y)|

Figure 1: EDF plot for the Anderson Data

5

EDF Plot with Expected CDF 1 0.9 0.8

Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 207

205

202

199

197

194

192

189

186

184

181

179

176

173

171

168

166

163

0 Length of Heads in mm EDF

Normal CDF

From the plot it appears that the data is a considerably good fit to the normal distribution. The null hypothesis that, the given random sample comes from a population with distribution function F(.) is accepted if the calculated value of D is less than the critical value of D, for a given sample size and for a given level of significance. If the empirical CDF is plotted, then a confidence band can easily be drawn around it based on the K–S statistic. Kotz and Johnson (1988) give a hint about the development of such bands. Here, the band has been developed mathematically and then based on the data from Anderson (1958) such a plot is produced. Let Dα(n) be critical value of the Kolmogorov – Smirnov Statistics at α% level of significance. Now, let D be the calculated value of the K–S statistic, i.e. we have, 1 – α = P( D ≤ Dα(n))

6 = P(Sup |Fn(y)–F(y)| ≤ Dα(n)) = P( |Fn(y)–F(y)| ≤ Dα(n) ∀ y) = P( –Dα(n)≤ Fn(y)–F(y) ≤ Dα(n) ∀ y) = P( F(y) – Dα(n)≤ Fn(y) ≤ F(y)+ Dα(n) ∀ y) Thus, [F(y) – Dα(n), F(y) + Dα(n)] is the 100(1 – α)% confidence band for Fn(y). The table value of Dα(n) for α = 0.05 and n = 24 is 0.27. The 95% confidence band for the expected CDF values can be seen in table 2. Table 2: EDF values along with the K−S bounds Head

EDF

length

Expected

Lower confi.

Upper confi.

F(y)

Band

Band

163

0.041667

0.010201

-0.2598

0.280201

174

0.083333

0.117556

-0.15244

0.387556

174

0.125

0.117556

-0.15244

0.387556

175

0.166667

0.139089

-0.13091

0.409089

176

0.208333

0.16316

-0.10684

0.43316

176

0.25

0.16316

-0.10684

0.43316

179

0.291667

0.250492

-0.01951

0.520492

181

0.333333

0.320177

0.050177

0.590177

181

0.375

0.320177

0.050177

0.590177

183

0.416667

0.396867

0.126867

0.666867

183

0.458333

0.396867

0.126867

0.666867

186

0.5

0.518801

0.248801

0.788801

188

0.541667

0.599821

0.329821

0.869821

188

0.583333

0.599821

0.329821

0.869821

189

0.625

0.638988

0.368988

0.908988

190

0.666667

0.676748

0.406748

0.946748

191

0.708333

0.71277

0.44277

0.98277

192

0.75

0.746771

0.476771

1.016771

7 Head

EDF

Expected

Lower confi.

Upper confi.

F(y)

Band

Band

length 192

0.791667

0.746771

0.476771

1.016771

195

0.833333

0.834717

0.564717

1.104717

195

0.875

0.834717

0.564717

1.104717

197

0.916667

0.880744

0.610744

1.150744

197

0.958333

0.880744

0.610744

1.150744

208

1

0.989564

0.719564

1.259564

Looking at the corresponding plot, in Figure 2, one can draw the inference that the random sample comes from a normal distribution. Here, the null hypothesis is accepted if the graph of

the

EDF

lies

within

the

confidence

band

for

all

values

of

y.

4. Problem in the Bands and its Correction Again the band itself seems to be defective. The following point may be noted about the bands: (a) The band shows equal width throughout the graph. It is known that the tail regions would be more sensitive than the middle. But the plot makes the band unnecessarily broad in the tails, (b) Also it may be noted that the ordinates of the bands in some of the cases are greater than one and in some cases it is negative. Keeping the first issue in mind Bickel and Doksum (1977) improved the K–S Statistic by dividing it by a factor {F(x) (1 – F(x))}1/2 . This factor acts as variance equalizer and the band thus generated would be slightly wider in the middle and much narrower at the tails compared to the first band. Accordingly the following expression may be derived. Figure 2: An EDF plot along with K−S Bounds

8

EDF Plot with K-S Bounds 1.3 1.1 0.9

Probability

0.7 0.5 0.3

207

205

202

199

197

194

192

189

186

184

181

179

176

173

171

168

166

-0.1

163

0.1

-0.3 -0.5 Length of Heads in mm EDF

1 – α = P[Sup

= P[

Normal CDF

Lower Bound

Upper Bound

| Fn ( y ) − F ( y ) | ≤ Dα(n)] [ F ( y )(1 − F ( y ))]1 / 2

| Fn ( y ) − F ( y ) | ≤ Dα(n) ∀ y] [ F ( y )(1 − F ( y ))]1 / 2

= P[–Dα(n) ≤

Fn ( y ) − F ( y ) ≤ Dα(n) ∀ y] [ F ( y )(1 − F ( y ))]1 / 2

= P[F(y) – F ( y )(1 − F ( y ) Dα(n) ≤ Fn(y) ≤ F(y)+ F ( y )(1 − F ( y ) Dα(n) ∀ y] Thus, [F(y) – F ( y )(1 − F ( y ) Dα(n), F(y) +

F ( y )(1 − F ( y ) Dα(n)] is the 100(1 – α)%

Doksum confidence band for Fn(y). Table 3 shows the calculated values of the expected CDF and the corresponding Doksum bounds for α = 0.05. Table 3: EDF and Doksum bounds of the data

9 Head

EDF

length

Expected

Lower

Upper

F(y)

Doksum

Doksum

Bounds

Bounds

163

0.041667

0.010201

-0.01693

0.037332

174

0.083333

0.117556

0.030594

0.204518

174

0.125

0.117556

0.030594

0.204518

175

0.166667

0.139089

0.045658

0.23252

176

0.208333

0.16316

0.063392

0.262928

176

0.25

0.16316

0.063392

0.262928

179

0.291667

0.250492

0.133502

0.367482

181

0.333333

0.320177

0.19421

0.446144

181

0.375

0.320177

0.19421

0.446144

183

0.416667

0.396867

0.26477

0.528964

183

0.458333

0.396867

0.26477

0.528964

186

0.5

0.518801

0.383896

0.653706

188

0.541667

0.599821

0.467539

0.732103

188

0.583333

0.599821

0.467539

0.732103

189

0.625

0.638988

0.509309

0.768667

190

0.666667

0.676748

0.550464

0.803032

191

0.708333

0.71277

0.590603

0.834937

192

0.75

0.746771

0.629359

0.864183

192

0.791667

0.746771

0.629359

0.864183

195

0.833333

0.834717

0.734429

0.935005

195

0.875

0.834717

0.734429

0.935005

197

0.916667

0.880744

0.79324

0.968248

197

0.958333

0.880744

0.79324

0.968248

208

1

0.989564

0.962126

1.017002

With the correction in the bands the plot seems to recover its disadvantage discussed earlier (refer Figure 3). The plot can be used as a quick tool to check the normality assumption of any dataset.

10 Figure 3: An EDF plot along with the Doksum Bounds

EDF Plot with Doksum Bounds 1.1

0.9

Probability

0.7

0.5

0.3

207

205

202

199

197

194

192

189

186

184

181

179

176

173

171

168

166

-0.1

163

0.1

Length of Heads in mm EDF

Normal CDF

Lower Bound

Upper Bound

It can be used as a formal tool for data analysis as the goodness of fit can be tested for different level of significance as well. Software developed for drawing the plot in any high level computer language having a visual impact can put an end to the computations involved in drawing the plot. 5. Concluding Remarks The reliability of the plotting technique is linked with the Kolmogorov-Smirnov statistic and Doksum bounds. The technique should not be considered as a new technique of testing the goodness of fit of a distribution but may be considered a visual representation of the K-S test. Though, the discussion centers on the normality of the data set, yet the plot can also be used to check the fit of the data to any continuous distribution. Being entirely dependent on the K-S statistic the plot is based on the assumptions on which the K-S test is

11 obtained. The plot may also be used along with the test to provide the visualization of its own. 6. Reference Anderson, T. W. (1958), An Introduction to Multivariate Statistical Analysis. New York: Wiley, 58. Bickel, P.J. and Doksum, K. (1977), Mathematical Statistics, San Francisco: Holden-Day. Fisher, N. I. [1983]: “Graphical methods in nonparametric Statistics: A review and annotated Bibliography”, International Statistical Review, 51, 25 − 58. Kendall, M. G. and Stuart, A. (1991), The Advanced Theory of Statistics, Vol, 2: Inference and Relationship, Hafner Press, New York. Kotz, S. and Johnson, N. L. (1982), Encyclopedia of Statistical Sciences. Volume 1. John Wiley and Sons, Inc., 85−86. Rice, J. A. (1995), Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, California, University of California, Berkeley. Satten, G.A. [1995]: “Upper and lower bound distributions that give simultaneous confidence intervals for quantiles.” Journal of American Statistical Association, 90, No. 430, 747 − 752. Shapiro, S. S. [1990]: “How to test normality and other distributional assumptions.” American Society for Quality Control, Statistics Divison. Tukey, J. W. (1977), Explanatory Data Analysis (First Edition), Vol. 1, Ch. 5, Reading, Mass: Addison – Wesley Publishing Co. Wilk, M.B., and Gananadesikan, R. [1968]: “Probability plotting methods of analysis of data.” Biometrika, 55, 1-17.

12

Suggest Documents