1
On Using Empirical Distribution Function Plot for Checking the Normality Assumption of a Data Set Dibyojyoti Bhattacharjee
Reader Department of Business Administration Assam University, Silchar Assam Email:
[email protected] and
Kishore K. Das Reader Department of Statistics, Gauhati University, Guwahati, Assam, India. Email:
[email protected] Published in Assam Statistical Review, Vol. 22, No. 1, pp. 50−58. 2009
Submitted to Social Science Research Network (SSRN) on 28 June, 2010 This paper can be downloaded without charge from the Social Science Research Network electronic library at: http://ssrn.com/
Other works of the author in SSRN can be viewed at: http://ssrn.com/author=1335385 ©2010 by Dibyojyoti Bhattacharjee. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.
2 On Using Empirical Distribution Function Plot for Checking the Normality Assumption of a Data Set Dibyojyoti Bhattacharjee and Kishore K. Das Abstract Since the normal distribution is the gate keeper to several statistical procedures, so conformity of a given data set to its normality assumptions is often of concern to the statisticians. Numerous parametric and non-parametric tests are available for testing the goodness of fit, some commonly used goodness of fit tests are chi-square goodness of fit test, Anderson Darling test, Kolmogorov-Smirnov (K-S) test, Wilks Shapiro Normality test, etc. Some graphical techniques are also available to check the goodness of fit to some of the hypothetical distributions under consideration like normal probability plot, quantile plot etc. The aim of this paper is to develop a graphical tool based on empirical distribution function and the Kolmogorov-Smirnov (K-S) statistic so that the normality of a given dataset can be checked as well as visualized. Key words: Empirical Distribution Function, Kolmogorov-Smirnov statistic, Statistical Graphics, distribution fitting. 1. Introduction Checking the normality of a data set is a routine exercise for statisticians, especially those involved in data analysis. There are a number of efficient statistical tests both parametric and non-parametric for the purpose. Most of the statistical software has provided its user with a multiple number of options for the same purpose like the Anderson Darling test, Kolmogorov-Smirnov (K-S) test, Wilks Shapiro Normality test, etc. However, many graphical techniques are also available to check the goodness of fit to the hypothetical distribution under consideration, mention can be made of probability plot developed by Wilk
3 and Gananadesikan (1968), Quantile plot, hanging rootogram by Tukey (1977) etc. Literature of Kotz and Johnson (1988), Rice (1995) and others referred to such graphical techniques as informal techniques for data analysis and model assessment. But Fisher (1983), Bickel and Doksum (1977), Shapiro (1990), Satten (1995), Kendall and Stuart (1991) described various graphical techniques, which can be used for formal statistical inference. 2. The Plot and its Logic Let (yi, i=1,2,…,n) be a random sample of size ‘n’ on Y, let F(y) be the corresponding cumulative distribution function (CDF) of the random variable. Also, let (y(i), i=1,2,…,n) denotes the corresponding order statistics. The empirical CDF is given by, Fn(y) = (no. of observations ≤ y) / n Thus, this function will be a step function lying between 0 and 1, with jumps of size 1/n at each observation. This empirical distribution function popularly called as the EDF, is used as the non–parametric estimate of the actual CDF. Under the assumption that the data of head length from Anderson (1958) follows normal distribution we draw the corresponding CDF along with the EDF against the corresponding values of y, as in Figure 1. It is obvious that in case of a good fit the EDF plot i.e. the step function will grow around the plotted CDF. The calculations for the data along with the expected probabilities for raw data from Anderson (1958) have been shown in Table 1. Table 1: EDF values for the Anderson Data Head
EDF
length
Expected
Head
F(y)
Length
EDF
Expected F(y)
163
0.041667
0.010201
188
0.541667
0.599821
174
0.083333
0.117556
188
0.583333
0.599821
174
0.125
0.117556
189
0.625
0.638988
175
0.166667
0.139089
190
0.666667
0.676748
4 Head
EDF
length
Expected
Head
F(y)
Length
EDF
Expected F(y)
176
0.208333
0.16316
191
0.708333
0.71277
176
0.25
0.16316
192
0.75
0.746771
179
0.291667
0.250492
192
0.791667
0.746771
181
0.333333
0.320177
195
0.833333
0.834717
181
0.375
0.320177
195
0.875
0.834717
183
0.416667
0.396867
197
0.916667
0.880744
183
0.458333
0.396867
197
0.958333
0.880744
186
0.5
0.518801
208
1
0.989564
3. The Problem and its Solution However the goodness or lack of fit may not be always as clear as this one (refer Figure 1). For some data set it may be controversial to decide about the normality of the data set by comparing the CDF curve and the corresponding graph of the EDF alone. A solution to this problem would be to derive a confidence band around the graph of the theoretical CDF. The band, width of which will be dependent on the level of significance can be derived using the Kolmogorov – Smirnov statistic, which is the famous non– parametric test, that is used to test if a given random sample comes from a specific distribution and is based on the EDF. The test statistic is given by D = Sup |Fn(y)–F(y)|
Figure 1: EDF plot for the Anderson Data
5
EDF Plot with Expected CDF 1 0.9 0.8
Probability
0.7 0.6 0.5 0.4 0.3 0.2 0.1 207
205
202
199
197
194
192
189
186
184
181
179
176
173
171
168
166
163
0 Length of Heads in mm EDF
Normal CDF
From the plot it appears that the data is a considerably good fit to the normal distribution. The null hypothesis that, the given random sample comes from a population with distribution function F(.) is accepted if the calculated value of D is less than the critical value of D, for a given sample size and for a given level of significance. If the empirical CDF is plotted, then a confidence band can easily be drawn around it based on the K–S statistic. Kotz and Johnson (1988) give a hint about the development of such bands. Here, the band has been developed mathematically and then based on the data from Anderson (1958) such a plot is produced. Let Dα(n) be critical value of the Kolmogorov – Smirnov Statistics at α% level of significance. Now, let D be the calculated value of the K–S statistic, i.e. we have, 1 – α = P( D ≤ Dα(n))
6 = P(Sup |Fn(y)–F(y)| ≤ Dα(n)) = P( |Fn(y)–F(y)| ≤ Dα(n) ∀ y) = P( –Dα(n)≤ Fn(y)–F(y) ≤ Dα(n) ∀ y) = P( F(y) – Dα(n)≤ Fn(y) ≤ F(y)+ Dα(n) ∀ y) Thus, [F(y) – Dα(n), F(y) + Dα(n)] is the 100(1 – α)% confidence band for Fn(y). The table value of Dα(n) for α = 0.05 and n = 24 is 0.27. The 95% confidence band for the expected CDF values can be seen in table 2. Table 2: EDF values along with the K−S bounds Head
EDF
length
Expected
Lower confi.
Upper confi.
F(y)
Band
Band
163
0.041667
0.010201
-0.2598
0.280201
174
0.083333
0.117556
-0.15244
0.387556
174
0.125
0.117556
-0.15244
0.387556
175
0.166667
0.139089
-0.13091
0.409089
176
0.208333
0.16316
-0.10684
0.43316
176
0.25
0.16316
-0.10684
0.43316
179
0.291667
0.250492
-0.01951
0.520492
181
0.333333
0.320177
0.050177
0.590177
181
0.375
0.320177
0.050177
0.590177
183
0.416667
0.396867
0.126867
0.666867
183
0.458333
0.396867
0.126867
0.666867
186
0.5
0.518801
0.248801
0.788801
188
0.541667
0.599821
0.329821
0.869821
188
0.583333
0.599821
0.329821
0.869821
189
0.625
0.638988
0.368988
0.908988
190
0.666667
0.676748
0.406748
0.946748
191
0.708333
0.71277
0.44277
0.98277
192
0.75
0.746771
0.476771
1.016771
7 Head
EDF
Expected
Lower confi.
Upper confi.
F(y)
Band
Band
length 192
0.791667
0.746771
0.476771
1.016771
195
0.833333
0.834717
0.564717
1.104717
195
0.875
0.834717
0.564717
1.104717
197
0.916667
0.880744
0.610744
1.150744
197
0.958333
0.880744
0.610744
1.150744
208
1
0.989564
0.719564
1.259564
Looking at the corresponding plot, in Figure 2, one can draw the inference that the random sample comes from a normal distribution. Here, the null hypothesis is accepted if the graph of
the
EDF
lies
within
the
confidence
band
for
all
values
of
y.
4. Problem in the Bands and its Correction Again the band itself seems to be defective. The following point may be noted about the bands: (a) The band shows equal width throughout the graph. It is known that the tail regions would be more sensitive than the middle. But the plot makes the band unnecessarily broad in the tails, (b) Also it may be noted that the ordinates of the bands in some of the cases are greater than one and in some cases it is negative. Keeping the first issue in mind Bickel and Doksum (1977) improved the K–S Statistic by dividing it by a factor {F(x) (1 – F(x))}1/2 . This factor acts as variance equalizer and the band thus generated would be slightly wider in the middle and much narrower at the tails compared to the first band. Accordingly the following expression may be derived. Figure 2: An EDF plot along with K−S Bounds
8
EDF Plot with K-S Bounds 1.3 1.1 0.9
Probability
0.7 0.5 0.3
207
205
202
199
197
194
192
189
186
184
181
179
176
173
171
168
166
-0.1
163
0.1
-0.3 -0.5 Length of Heads in mm EDF
1 – α = P[Sup
= P[
Normal CDF
Lower Bound
Upper Bound
| Fn ( y ) − F ( y ) | ≤ Dα(n)] [ F ( y )(1 − F ( y ))]1 / 2
| Fn ( y ) − F ( y ) | ≤ Dα(n) ∀ y] [ F ( y )(1 − F ( y ))]1 / 2
= P[–Dα(n) ≤
Fn ( y ) − F ( y ) ≤ Dα(n) ∀ y] [ F ( y )(1 − F ( y ))]1 / 2
= P[F(y) – F ( y )(1 − F ( y ) Dα(n) ≤ Fn(y) ≤ F(y)+ F ( y )(1 − F ( y ) Dα(n) ∀ y] Thus, [F(y) – F ( y )(1 − F ( y ) Dα(n), F(y) +
F ( y )(1 − F ( y ) Dα(n)] is the 100(1 – α)%
Doksum confidence band for Fn(y). Table 3 shows the calculated values of the expected CDF and the corresponding Doksum bounds for α = 0.05. Table 3: EDF and Doksum bounds of the data
9 Head
EDF
length
Expected
Lower
Upper
F(y)
Doksum
Doksum
Bounds
Bounds
163
0.041667
0.010201
-0.01693
0.037332
174
0.083333
0.117556
0.030594
0.204518
174
0.125
0.117556
0.030594
0.204518
175
0.166667
0.139089
0.045658
0.23252
176
0.208333
0.16316
0.063392
0.262928
176
0.25
0.16316
0.063392
0.262928
179
0.291667
0.250492
0.133502
0.367482
181
0.333333
0.320177
0.19421
0.446144
181
0.375
0.320177
0.19421
0.446144
183
0.416667
0.396867
0.26477
0.528964
183
0.458333
0.396867
0.26477
0.528964
186
0.5
0.518801
0.383896
0.653706
188
0.541667
0.599821
0.467539
0.732103
188
0.583333
0.599821
0.467539
0.732103
189
0.625
0.638988
0.509309
0.768667
190
0.666667
0.676748
0.550464
0.803032
191
0.708333
0.71277
0.590603
0.834937
192
0.75
0.746771
0.629359
0.864183
192
0.791667
0.746771
0.629359
0.864183
195
0.833333
0.834717
0.734429
0.935005
195
0.875
0.834717
0.734429
0.935005
197
0.916667
0.880744
0.79324
0.968248
197
0.958333
0.880744
0.79324
0.968248
208
1
0.989564
0.962126
1.017002
With the correction in the bands the plot seems to recover its disadvantage discussed earlier (refer Figure 3). The plot can be used as a quick tool to check the normality assumption of any dataset.
10 Figure 3: An EDF plot along with the Doksum Bounds
EDF Plot with Doksum Bounds 1.1
0.9
Probability
0.7
0.5
0.3
207
205
202
199
197
194
192
189
186
184
181
179
176
173
171
168
166
-0.1
163
0.1
Length of Heads in mm EDF
Normal CDF
Lower Bound
Upper Bound
It can be used as a formal tool for data analysis as the goodness of fit can be tested for different level of significance as well. Software developed for drawing the plot in any high level computer language having a visual impact can put an end to the computations involved in drawing the plot. 5. Concluding Remarks The reliability of the plotting technique is linked with the Kolmogorov-Smirnov statistic and Doksum bounds. The technique should not be considered as a new technique of testing the goodness of fit of a distribution but may be considered a visual representation of the K-S test. Though, the discussion centers on the normality of the data set, yet the plot can also be used to check the fit of the data to any continuous distribution. Being entirely dependent on the K-S statistic the plot is based on the assumptions on which the K-S test is
11 obtained. The plot may also be used along with the test to provide the visualization of its own. 6. Reference Anderson, T. W. (1958), An Introduction to Multivariate Statistical Analysis. New York: Wiley, 58. Bickel, P.J. and Doksum, K. (1977), Mathematical Statistics, San Francisco: Holden-Day. Fisher, N. I. [1983]: “Graphical methods in nonparametric Statistics: A review and annotated Bibliography”, International Statistical Review, 51, 25 − 58. Kendall, M. G. and Stuart, A. (1991), The Advanced Theory of Statistics, Vol, 2: Inference and Relationship, Hafner Press, New York. Kotz, S. and Johnson, N. L. (1982), Encyclopedia of Statistical Sciences. Volume 1. John Wiley and Sons, Inc., 85−86. Rice, J. A. (1995), Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, California, University of California, Berkeley. Satten, G.A. [1995]: “Upper and lower bound distributions that give simultaneous confidence intervals for quantiles.” Journal of American Statistical Association, 90, No. 430, 747 − 752. Shapiro, S. S. [1990]: “How to test normality and other distributional assumptions.” American Society for Quality Control, Statistics Divison. Tukey, J. W. (1977), Explanatory Data Analysis (First Edition), Vol. 1, Ch. 5, Reading, Mass: Addison – Wesley Publishing Co. Wilk, M.B., and Gananadesikan, R. [1968]: “Probability plotting methods of analysis of data.” Biometrika, 55, 1-17.
12