Basics of nonparametric statistics

2 downloads 0 Views 302KB Size Report
The tie corresponding to the absolute difference 0.5 occurs on the 3rd and 4th place. ..... expect, to say something, an accuracy to one decimal place. One could ...
Chapter 7

Basics of nonparametric statistics Everybody believes in the normal approximation, the experimenters because they believe it is a mathematical theorem, the mathematicians because they believe it is an experimental fact! Attributed to (a jesting) Henry Poincaré (1854 − 1912)

7.1

Introduction

Many statistical procedures lean somehow somewhere on assumptions that the populations from which the samples are drawn have Gaußian distributions. What to do if such Gaußian conditions are not met? Generally spoken there are two options: • transform the dataset into one that satisfies the necessary conditions. • use alternative procedures that don’t lean on Gaußian assumptions. The first option is explained in one of the exercises. The procedures mentioned in the second option are called nonparametric methods. Statistics can thus be split up into two branches:   parametric statistics  nonparametric So, procedures leaning on the concept of normality are called parametric methods. The t-tests (all flavours) and ANOVA are examples of parametric methods. As was already mentioned, sometimes, when the necessary Gaußian conditions are not met, one may apply nonparametric alternatives. Some of them are listed below:

7.2 Wilcoxon’s test

PARAMETRIC

1-sample t-test 2-sample t-test paired samples t-test 1-way ANOVA 2-way ANOVA

232

NONPARAMETRIC

Wilcoxon test Mann-Whitney test Wilcoxon test Kruskal-Wallis test ???

The nonparametric methods listed above are all based on the idea of replacing the original data by rank numbers. Besides these there also methods that do not lean on rank number techniques and some of them do not have parametric equivalents. The Kolmogorov-Smirnov tests, both the 1-sample and the 2-sample version, are examples of them. Last but not least there are the so-called bootstrap techniques, which are almost universally applicable. Bootstrap techniques will be discussed in the last section of this chapter.

7.2

Wilcoxon’s test

In this section the so-called Wilcoxon test will be explained. This test may serve as a fall-back in situations where the conditions of a 1-sample t-test are not met. In Wilcoxon’s test the original data are replaced by rank numbers. To illustrate how this is carried out, imagine the following sample, which could be thought to be a set of weights of 7 randomly chosen dogs. 5.2 4.5 6.2 5.3 9.1 12.3 3.2 There is a wish to test the following null hypothesis H0 : median in population = 5 The first step is now to determine the differences relative to the population median, which was hypothesized to be 5. This leads to 0.2

− 0.5 1.2 0.3 4.1 7.3

− 1.8

The next step is to order the absolute values of these differences (the differences in positive sense) from low to high: | 0.2 | | 0.3 | | − 0.5 | | 1.2 | | − 1.8 | | 4.1 | | 7.3 | Then assign the differences the rank numbers in the sequence above: 0.2 1

−0.5 3

1.2 4

0.3 2

4.1 6

7.3 7

−1.8 5

233 Chapter 7.

Nonparametric statistics

Note (see also page 223) that the sum of all rank numbers is W =1+2+3+4+5+6+7=

1+7 × 7 = 28 2

The last step is to add up the rank numbers belonging to positive differences; their sum is often denoted as W + . In this particular example one has W + = 1 + 4 + 2 + 6 + 7 = 20 Similarly the sum of the rank numbers belonging to negative differences is computed to be W− = 3 + 5 = 8 Of course one has W + + W − = W = 28 If the median in the population were 5 indeed, then, by symmetry, one could expect that W + ≈ 14 and W − ≈ 14 The decision procedure is based on the discrepancy between W + and W/2 or, which amounts to the same, the discrepancy between W − and W/2. Under the null hypothesis this discrepancy can be expected to be small. If it turns out to be large, then the null hypothesis is rejected. But what is small and what is large? Probability theory offers a helping hand here by converting the discrepancy into a p-value. This p-value can then be compared to the sacred bound of 0.05 to make a decision. The p-value may be obtained by using suitable software. Alternatively, for small sample sizes, one could look up the critical bounds for W + or W − in the table on page 430. In this table some quantiles in the null distribution of W + or W − are given. For a large sample size n the quantities W + or W − behave, under the null hypothesis, as if they are Gaußian distributed with

µ=

n(n + 1) 4

and

σ2 =

n(n + 1)(2n + 1) 24

Hence, for large sample sizes the null distribution of W + and W − is approximately Gaußian with the parameters given above. See for example [26] for the underlying mathematical details. Below is an example with a large sample size. Example. The weights of 25 randomly chosen cats is determined. There is a wish to test the hypothesis that the median weight in the cat population is 3.5 kilogram. The

7.2 Wilcoxon’s test

234

value of W + and W − are computed according to the recipe described above. The results are W + = 210 and W + = 115 For the given sample size the values of W + behave approximately as if they are Gaußian distributed with µ=

25 × (25 + 1) = 162.5 2

and

25 × (25 + 1) × (2 × 25 + 1) = 1381.25 24 To convert the outcome for W + into a p-value, just determine the size of√a 2-sided tail in a Gaußian distribution with mean 162.5 and standard deviation of 1381.25, which is 37.17. This leads to a p-value of 20.1%. When testing at a 2-sided significance level of 5%, the sample does not provide evidence against the hypothesized median of 3.5. The hypothesis is maintained. 2 σ2 =

Although the Wilcoxon test does not need a Gaußian population to return correct pvalues, there are some conditions to be met. The following two conditions must be fulfilled to have the p-values reliable: • the population that is involved has a continuous distribution. • the population is symmetrically distributed around the median.

The first condition, the population distribution be continuous, is there to prevent the occurrence of ties in the data provided by samples. Ties, namely, disturb the underlying probabilistic mechanism in the test. In practice, however, ties are often inevitable. The so-called midrank technique may be used to reduce the damage caused by them. In the following example this midrank technique is illustrated. Example. Consider the following series of 10 dog weights: 5.2 4.5 1.9 6.2 5.3 8.1 12.3 3.2 8.1 5.5 There is a wish to test the following null hypothesis H0 : median in population = 5 The differences relative to this hypothesized population median are 0.2

− 0.5

− 3.1 1.2 0.3 3.1 7.3

− 1.8 3.1 0.5

235 Chapter 7.

Nonparametric statistics

Sorting their absolute values leads to the sequence 0.2 0.3 0.5 0.5 1.2 1.8 3.1 3.1 3.1 7.3 In the sequence above the numbers 0.5 and 3.1 are ties with frequency 2 and 3 respectively. The tie corresponding to the absolute difference 0.5 occurs on the 3rd and 4th place. The midrank technique consists here of assigning the rational rank 3+4 = 3 12 2 to both numbers in the tie. Similarly the three differences in the tie 3.1 all get the rank number 7+8+9 =7 3 All together this leads to the following rank table: 0.2 1

−0.5 3 21

−3.1 7

1.2 5

0.3 2

3.1 7

7.3 10

−1.8 6

3.1 7

0.5 3 21

The value of W+ is here W+ = 1 + 2 + 3 12 + 5 + 7 + 7 + 10 = 35 12 and the value of W− is W− = 3 21 + 6 + 7 = 19 12 As ought to be, they add up like W+ + W− = 1 + 2 + 3 + 4 + · · · + 9 + 10 = 55 As can be read off from the table on page 430, the 2-sided critical bounds for W− and W+ , at a 2-sided significance level of 0.05, are equal to 8 and 55 − 8 = 47. The outcomes for W− and W+ , generated by the given 10 dog weights, are within these bounds. So, at the given significance level, they don’t provide evidence to reject the hypothesized value of 5 for the population median. 2 If there are lots of ties in the data then, despite the midrank technique, the p-values returned by Wilcoxon’s become unreliable.

Wilcoxon’s test is named after the American statistician Frank Wilcoxon (18921965), who proposed this test in a paper published in 1945. Remember that the Wilcoxon test is a test about medians, not about means. The test provides an alternative to the t-test in cases where the population fails to be Gaußian and is, as such,

7.3 The Mann-Whitney test

236

often used. For non-Gaußian populations the Wilcoxon test often shows a higher power performance than the t-test. In the case of Gaußian populations one may use both the Wilcoxon test or the t-test. The t-test then shows as a rule a higher power. In [4], however, it is shown that this advantage is very modest. The Wilcoxon test can also be applied as a nonparametric alternative to the pairedsamples t-test. In such cases the above is applied to the differences within the pairs and these differences are checked against the null hypothesis that the median be zero. Very similar to the way a paired-sample t-test may be reduced to a 1-sample t-test on the differences within the pairs.

7.3

The Mann-Whitney test

The Mann-Whitney test was designed to detect whether the medians in two samples differ significantly. Like Wilcoxon’s test, the Mann-Whitney test is based on the idea of replacing the original data by their rank numbers. To illustrate how the original data is transformed into rank numbers, imagine the following sequence of height measurements in Holland of 3 women and 4 men. WOMEN

MEN

175 (4) 169 (2) 168 (1)

180 (7) 172 (3) 177 (5) 179 (6)

In the table above the numbers between brackets are the rank numbers. The sum of all rank numbers (see also page 223) is T =1+2+3+4+5+6+7=

1+7 × 7 = 28 2

The sum of the female rank numbers is Tfemale = 4 + 2 + 1 Similarly, the sum of the male rank numbers is Tmale = 7 + 3 + 5 + 6 = 21 The Mann-Whitney test is in this example a decision procedure whether or not to reject the following null hypothesis on population medians (as to height): H0 : median height in female population = median height in male population

237 Chapter 7.

Nonparametric statistics

If this hypothesis were true, then one may expect that Tfemale ≈

3 × 28 = 12 7

Tmale ≈

4 × 28 = 16 7

and that

The decision procedure is based on the difference between the expected values and the outcomes for Tfemale and Tmale that were generated by the sample. If this difference is large then the null hypothesis is rejected. But what is considered large? The discrepancy between what the value of Tfemale actually is and what it should (approximately) be under the null hypothesis can be captured in terms of probability theory. Thus it can be converted into a p-value. One may use suitable software to bring this about. Alternatively one could use, for small sample sizes, the tables on page 431 to find the critical bounds for Tfemale and Tmale in their null distribution. For the illustrating scenario sketched above the critical bounds for Tfemale , when testing at a confidence level of 90%, are 6 and 18. The sample generated for Tfemale an outcome of 12. This outcome is not beyond the critical bounds, hence the sample does not provide enough evidence to reject the null hypothesis of equal medians in the populations. Generally, for large sample sizes in a group A and a group B, the quantities TA and TB behave as if they are approximately Gaußian distributed. If the size of the sample A were m and that of the sample B were n then the parameters of the Gaußian distribution in question are

µA =

m(m + n + 1) 2 2 2 σA = σB =

and

µB =

n(m + n + 1) 2

mn(m + n + 1) 12

So, for large m and n, the null distribution of TA is approximately Gaußian with the parameters given above. See for example [26] for more mathematical details. Below is an example that illustrates how the above can be exploited. Example. Suppose that in the introductory scenario there were 15 women and 20 men. How to decide about the null hypothesis if the female rank numbers add up to

7.3 The Mann-Whitney test

238

567? Under the null hypothesis of equal medians in the male and female populations the quantity Tfemale is approximately Gaußian distributed with parameters µfemale =

15 × (15 + 20 + 1) = 630 2

and

15 × 20 × (15 + 20 + 1) = 900 12 In this Gaußian distribution the outcome 567 has a 2-sided tail of size 0.036. Hence the p-value corresponding to the outcome 567 is 3.6%. When testing at a confidence level of 95% the null hypothesis should be rejected. More colloquially, the medians of the female and the male heights differ significantly. 2 2 σfemale =

The Mann-Whitney test was designed to reveal possible significant differences as to the medians of two populations. In this it does not need Gaußian populations. However, some conditions are to be met to make the test return reliable p-values: • the two samples are to be drawn independently. • the two populations that are involved have continuous distributions. • the shape of the two population distributions must be the same. The second condition, the two population distributions be continuous, is there to prevent the occurrence of ties in the data provided by the two samples. As in Wilcoxon’s signed rank test, discussed in the previous section, ties disturb the underlying probabilistic mechanism in the test. The midrank technique, explained on page 210, may be used to reduce the damage caused by ties in the Mann-Whitney test. The third condition will be referred to as the shift condition. As an illustration, the figure below shows a situation where the two populations fail to be Gaußian but where the shift condition is perfectly met:

density of population A

0

5

10

15

20

density of population B

25

0

5

10

15

20

25

239 Chapter 7.

Nonparametric statistics

The two densities above are obviously not Gaußian. But the shape of the density of population B is exactly the same as that of population A. It can be obtained by translating the density of population A to the right. Hence the shift condition is met here. This is not the case in the following scenario:

density of population B

density of population A

0

5

10

15

20

25

0

5

10

15

20

25

If in the Mann-Whitney test the shift condition is not met then it may return p-values that are strongly conflicting with common sense. It may occur, for example, that the two group medians are virtually the same whereas the Mann-Whitney test reports a significant difference as to them. See page 224 for some in-depth explanation and some striking examples as to this. On the other hand, if the populations fail to be Gaußian, but if they meet the shift condition, then the power of the Mann-Whitney test can be up to three times higher than in a 2-sample t-test. See for example [4], [8], [35], [36] for further reading. The underlying idea of the Mann-Whitney test was first proposed in 1914 by the German pedagogue Gustaf Deuchler (1883-1955). In his paper there was a missing term for the variance, however. Later on (in 1945) the test was independently developed, be it for equal sample sizes, by the Irish-American chemist and statistician Frank Wilcoxon (1892-1965). In 1947 the Austrian-American statistician Henry Mann (1905-2000), together with his student Donald Whitney (1915-2001), developed the test for cases where sample sizes are not necessarily equal. The MannWhitney test is in statistical literature also called Wilcoxon’s rank-sum test or the Wilcoxon-Mann-Whitney test.

7.4

The Kruskal-Wallis test

The so-called Kruskal-Wallis test generalizes the Mann-Whitney test to cases where the medians of more than two groups of data are to be compared. This is comparable to the way a 1-way ANOVA generalizes a 2-sample t-test. In this section the KruskalWallis test will be illustrated by means of an example: Suppose that a researcher

7.4 The Kruskal-Wallis test

240

wants to compare the effect of four diets on the liver weights of rats. An experiment leads to the following data: DIET

1

3.42 3.96 3.87 4.19 3.58 3.76 3.84

DIET

2

3.17 3.63 3.38 3.47 3.39 3.41 3.56 3.44

DIET

3

3.34 3.72 3.81 3.66 3.55 3.51

DIET

4

3.64 3.93 3.77 4.18 4.21 3.88 3.97 3.91

The null hypotheses that is to be tested is: H0 : the median population weight is the same for all diets. The first step in the test is the replacement of the original data by their rank numbers. This leads to the following table. DIET

6 25 21 28 12 17 20

1

DIET

1 13 3 8 4 5 11 7

2

DIET

2 16 19 15 10 9

3

DIET

4

14 24 18 27 29 22 26 23

The sum of the rank numbers for the four groups is given by S1 = 129 S2 = 52 S3 = 71 S4 = 183 The Kruskal-Wallis test is based on the value of these sums of rank numbers. Generally, in the case where there are g groups and where group i contains ni measurements, one denotes N = n1 + n2 + n3 + · · · + ng In these notations the test statistic T in the Kruskal-Wallis test is given by

241 Chapter 7.

Nonparametric statistics

g

X 12 T = ni N (N + 1)



i=1

Si N + 1 − ni 2

2

If T assumes a large value, that is to say, a value above some critical bound, then the null hypothesis is rejected; if it is small then the null hypothesis is maintained. For small sample sizes the critical bounds are tabulated. For large samples the null distribution of the test statistic T follows approximately a χ2 - distribution with g − 1 degrees of freedom. In the particular case of the example with rat livers one has g = 4 and N = 29. The value of T can thus be computed to be n 2 2 52 12 × 7 × 129 − 15 + 8 × − 15 + T = 29×30 7 8 6×

71 6

2

− 15

+8×

183 8

− 15

2 o

= 16.7804

Basing oneselves on a χ2 - distribution with 3 degrees of freedom the p-value belonging to this value of T is 0.0008. Hence the null hypothesis must be rejected. Otherwise formulated, the four diets differ significantly as to their effect on rat liver weights. The Kruskal-Wallis test was designed to reveal possible differences as to the medians of a set of g populations. It may serve as an alternative to a 1-way ANOVA if the populations are not Gaußian. It only does this its work properly, however, if the following conditions are met: • the g samples are to be drawn independently. • the g populations that are involved have continuous distributions. • the shape of the g population distributions must be the same. The second condition is there to avoid ties in the data provided by the g samples. When confronted to ties, the midrank technique, discussed on page 210, may be used to reduce the probabilistic damage caused by them. As in the Mann-Whitney test, the third condition is called the shift condition. The test is sensitive to as this condition. In the case of two populations the Kruskal-Wallis test is equivalent to the Mann-Whitney test. The Kruskal-Wallis test was developed by the American statisticians William Kruskal (1919-2005) and Wilson Wallis (1912-1998). They proposed the test in a paper that was published in 1952.

7.5 The runs test

7.5

242

The runs test

The runs test is there to detect clustering in sequences consisting of two symbols. For example, is there a tendency towards clustering in the following 2-symbol sequence? aaabbbbaaabbbbbbaaaaaaabbabbbbb The test is based on the number of so-called runs in the sequence. The runs in the sequence above are indicated below: aaa bbbb aaa bbbbbb aaaaaaa bb a bbbbb So in this particular case there are 8 runs. A low number of runs can be an indication of clustering. Conversely, a high number of runs can be an indication that there is more ‘mixing’ than could be expected on the basis of randomness. The null hypothesis is that there is neither clustering nor mixing. Thus the null hypothesis in a runs test is H0 : the symbols are ordered randomly. This hypothesis is tested by counting the actual number of runs in a 2-symbol sequence. It can be proved mathematically that, when putting m symbols a and n symbols b randomly in a sequence, the number of runs may be expected to be close to the number 2mn +1 m+n If the number of runs is far below or far above this reference point then the null hypothesis is rejected. For small samples the thresholds or critical bounds in this decision procedure may be found in the statistical table on page 433. Example. Suppose the following 2-symbol sequence is given: aaabbaaabbbaaaaabbabbbb Is the hypothesis of randomness maintainable when observing such a pattern? There are here 12 symbols a and 11 symbols b and hence, on the average, one may expect a number of runs close to 2 × 12 × 11 + 1 = 12.48 12 + 11 An actual count in the sequence shows that there are 8 runs. When testing at a confidence level of 95% the critical bounds for the number of runs can be read off the table on page 433 to be 7 and 18. The number of 8 runs is not beyond these bounds. The null hypothesis of randomness is maintained. 2

243 Chapter 7.

Nonparametric statistics

Tables for the runs test only provide critical bounds for scenarios where there are not too many symbols. For large numbers of symbols one may exploit the phenomenon that, under randomness, the number of runs is approximately Gaußian distributed. Otherwise formulated, the null distribution of the number of runs is asymptotically Gaußian. For a sequence of m symbols a and n symbols b the parameters µ and σ of the Gaußian distribution in question are given by

µ= σ2 =

2mn +1 m+n

2mn (2mn − m − n) (m + n)2 (m + n − 1)

Below is an example showing how these formulas may be used. Example. Suppose the following 2-symbol sequence is given: aaabbaabbbaaabbbbbbaabaaaaabbbbaaabbbbbbb There are here 18 symbols a and 23 symbols b. Under randomness the number of runs in such sequences is approximately Gaußian distributed with parameters µ=

2 × 18 × 23 + 1 = 21.20 18 + 23

and

2 × 18 × 23 × (2 × 18 × 23 − 18 − 23) = 19.38 (18 + 23)2 × (18 + 23 − 1) An actual count in the sequence shows that there are 12 runs. A 2-sided tail of this outcome in a Gaußian distribution with mean µ = 21.20 and standard deviation σ = √ 19.38 = 4.40 is of size 0.037. Hence the p-value corresponding to the outcome of 12 runs is 3.7%. When testing at a confidence level of 95% the null hypothesis (stating that there is randomness) must be rejected. 2 σ2 =

Note that the runs test differs from the tests in previous sections in that it does not lean on rank numbers. There is no parametric analogon for the runs test. See for example [15] or [21] for discussions about the power of this test.

7.6

Spearman’s and Kendall’s correlation coefficients

Spearman’s correlation coefficient is a non-parametric version of Pearson’s coefficient.

7.7 Bootstrap techniques

7.7

244

Bootstrap techniques

To illustrate the technique of bootstrapping, imagine a very simple experiment. Consider for example an experiment the aim of which is to estimate the mean bodily height of the Dutch adult population. To this end randomly 4 Dutch are selected and their height is measured. The resulting numbers are written down on a sheet of paper: 174 165 182 171

The mean of this particular sequence of numbers is 174, that is to say, the sample mean is 174. This sample mean is taken as an estimate for the population mean. So it is thought that population mean ≈ 174 Nothing is said about the accuracy of the estimate emerging from this experiment. To get an impression of the accuracy of the estimate above, one could repeat the whole experiment a couple of times and then have a look how the resulting estimates fluctuate. To set the idea, one could repeat the experiment for example 5 times. This would result in 5 sheets of paper with each 4 height measurements on it: 173 174 178 189

174 165 182 171

174 189 174 189

164 181 183 172

164 169 166 163

Each sheet of paper leads to an estimate, in this particular case these are 178.5

173.0

175.0

181.5

165.5

The fluctuations in this sequence of estimates are of a magnitude that one cannot expect, to say something, an accuracy to one decimal place. One could get a better impression of the accuracy by repeating the experiment more often, for example 1000 times. This would lead to a sequence of 1000 estimates. A kind of a 95% confidence interval could then be extracted from the 1000 estimates by determining the 2.5th and the 97.5th percentile. These percentiles can be regarded as the endpoints of a 95% confidence interval. In a picture this idea could be summarized as follows:

245 Chapter 7.

Nonparametric statistics

In the figure above the balls represent the estimates (for graphical reasons there are less than 1000 depicted). The number a is the 2.5th percentile and b the 97.5th percentile. In between a and b there are 95% percent of the estimates for the mean. The interval (a, b) can now be regarded as a kind of a 95% confidence interval for the population mean. This way of extracting a confidence interval from a sequence of estimates is called the percentile method. Of course in practice it is usually not possible to repeat an experiment 1000 times. However, as a kind of surrogate repetitions one could resort to resampling techniques. To illustrate this, consider the original dataset again. Draw at random, with replacement, 4 numbers from the original dataset and write them down on a piece of paper. A new surrogate dataset has been created by this. Such a dataset is called a bootstrap dataset or a bootstrap sample. It is important that the numbers are drawn with replacement. Otherwise, namely, all bootstrap datasets would be the same, apart from the order of the numbers. As an illustration, consider the following two datasets: 174 174 174 182

174 165 182 171

The left dataset is a bootstrap sample of the right one. The right one not of the left. In the figure below the creation of 5 bootstrap samples is illustrated: DATASET

174 165 182 171



171 174 174 182



174 165 182 171

?

165 182 182 174

R

174 174 182 182

SOME POSSIBLE BOOTSTRAP SAMPLES

j

165 165 165 165

7.7 Bootstrap techniques

246

Now consider the bootstrap samples as if they were obtained by repeating the experiment. If you happen to be interested in the mean, then determine the mean of each bootstrap dataset and obtain the sequence of so-called bootstrap means: 175.25

173.00

175.75

178.00

165.00

Of course this sequence is not long enough to extract an interval estimate of coverage 95% from it. To bring this about, one could determine more bootstrap datasets, for example 1000. When writing down each bootstrap dataset on a piece of paper, one could collect them and form a kind of a book with them. The collection of 1000 bootstrap datasets will thus form a book that consists of 1000 pages. To keep this picture in mind, a collection of bootstrap datasets will be called a bootstrap book. Now determine the mean of each page in the bootstrap book. It will result in 1000 bootstrap means. This sequence of 1000 numbers will be called an excerpt of the bootstrap book with respect to the mean. The excerpt (the 1000 bootstrap means) can be used to form an interval estimate of coverage 95% for the population mean. Just determine the 2.5th and the 97.5th percentile. These percentiles could serve as the endpoints of a 95% confidence interval. Repeating a figure given earlier, this idea could be summarized as:

In the figure above the balls now represent the bootstrap means (again for graphical reasons there are less than 1000 depicted). The number a is the 2.5th percentile and b the 97.5th percentile. In between a and b there are 95% percent of the bootstrap means. The interval (a, b) can be looked upon as a kind of an interval estimate of coverage 95% for the population mean. The amazing thing is that this works! And it does not only work with the mean, it also works with a lot of other statistical quantities. For example, it works with the median. The general principle of bootstrapping can be captured in the form of a simple graphical scheme:

DATASET

-

BOOTSTRAP BOOK

-

EXCERPT

- CONFIDENCE INTERVAL

247 Chapter 7.

Nonparametric statistics

Nowadays the creation of a bootstrap book can easily be performed by mathematical or statistical software packages. The excerpt contains of each page of the bootstrap book the mean, median, regression coefficients or whatever you happen to be interested in. The confidence intervals are extracted from the excerpt, for example by means of the percentile method.

In bootstrapping it is important that the original dataset is not too small and that the bootstrap book contains enough pages. It is difficult, if not, impossible to give general rules in this. However, two rough rules of thumb could be kept in mind when applying bootstrap techniques:

• The original dataset must contain at least 12 rows. • The bootstrap book must contain at least 500 pages.

There are examples where bootstrap techniques don’t work, even in cases where the rules of thumb given above are satisfied. The problems then usually manifest themselves in the final stage, where percentiles of the excerpt must be determined in order to extract a confidence interval. If, for example, the excerpt has not enough mutually different elements, then there is a problem. Namely, if the excerpt shows a lot of ties then it could be so that the 2.5th and the 97.5th percentile do not really make sense. This is the very reason that bootstrap techniques need continuous data to work well. Besides the kind of problems discussed above, there is a point that also must be mentioned here. So far the interval estimates have been extracted from the excerpt by means of the percentile method. The very big advantage of this method is that it is a simple method. Unfortunately it is not the most accurate way to extract a confidence interval from the excerpt. The resulting interval estimates, namely, tend to be a little bit too narrow to achieve the coverage that corresponds to the percentiles that were chosen. For example, when cutting off percentiles at 2.5% and 97.5%, the coverage of the resulting bootstrap interval estimates tend to be a bit lower than the desired 5%. Other methods have been developed, which have led to the construction of so-called bias corrected (BC) or bias corrected and accelerated (BCa) intervals. Careful investigations revealed that the BCa intervals have the best properties. See for example [5], [6] for some more documentation.

7.8 The Mann-Whitney test does not always compare medians

7.8

248

Encores

Encore: Adding up arithmetic sequences The German mathematician Carl Friedrich Gauß (1777-1855), his name has been mentioned an uncountable number of times in the previous sections, is today considered by many as the greatest mathematician that ever lived. There is a nice story about him as a 7 years old boy at school. It is about additions of type 1 + 2 + 3 + 4 + 5 + · · · + 99 + 100

(∗)

His teacher, namely, asked the class to sum all numbers from 1 to 100. The teacher may have wished to get some work done, or get some sleep, or whatever. The young Gauß, however, did not need more than a few seconds to write ‘5050’ in his slate. He noticed that 1 + 100 = 101, 2 + 99 = 101, 3 + 98 = 101, . . . formed a sequence of 50 pairs that summarizes the calculation to 50 × 101 = 5050. This is generally applicable to sums of type (∗). The trick shows that one generally has 1 + 2 + 3 + 4 + ··· + n =

n × (n + 1) 2

These type of sums frequently occur in statistical tests based on rank numbers.

Encore: The Mann-Whitney test does not always compare medians In a Mann-Whitney test it may occur that, while the two medians are almost equal, the test returns a significant p-value. Consider, for example, the 2 × 9 measurements sketched in the figure below, where the black balls present group A and the white ones group B. mA

mB

As can be read off from the figure, the two group medians mA and mB are close to each other. The rank sums can also be distilled from the figure. Namely, for group A one has: TA = (1 + 2 + 3 + 4) + (10 + 11 + 12 + 13 + 14) = 70 Similarly one has for group B the following rank sum TB = (5 + 6 + 7 + 8 + 9) + (15 + 16 + 17 + 18) = 106

249 Chapter 7.

Nonparametric statistics

So, in spite of the fact that the two medians are very close to each other, the two rank sums TA and TB differ considerably. By actually running the Mann-Whitney test this difference may be translated into a p-value of 0.1903. By increasing the group sizes, thereby maintaining the pattern shown in the figure above, it is easy to construct scenarios where the two medians are almost the same, but where MannWhitney returns a p-value below 0.05. The following is an example of this. Example. Consider the following two series of 19 blood sugar values: A : 4.30 4.40 4.68 4.70 4.77 4.79 4.81 4.87 4.93 6.01 6.10 6.25 6.33 6.35 6.47 6.61 6.67 6.79 6.90 and B : 5.10 5.21 5.33 5.39 5.53 5.65 5.67 5.75 5.90 5.99 7.07 7.13 7.19 7.21 7.23 7.30 7.32 7.60 7.70 The two medians are here 5.99 and 6.01 and, compared to the fluctuations in the series of measurements, their difference is minor. The rank sums for the series A and B, however, differ considerably. They are here TA = 290

TB = 451

and

When running the Mann-Whitney test the difference in rank sums is converted into a p-value of 0.01825. Hence the two medians differ significantly in this test. Note that this example may be modified such as to have the numbers 5.99 and 6.01 replaced by, for example, 5.9999 and 6.0001 without changing the p-value returned by the Mann-Whitney test. Thus the example may be transformed into one in which the two medians are virtually the same, thereby maintaining the p-value of 0.01825 2

The patterns in the data described above may very well occur when, for example, sampling from two populations that have densities as sketched in the figure below:

density of population B

density of population A

0

1

2

3

4

5

6

7

8

5

6

7

8

9

10

11

12

13

7.9 Exercises

250

The density of population A consists, roughly spoken, of two bell shapes. The right bell shape tends to be a high peak of width 1, centered around the number 6.5. Although this is hard to see through the naked eye, the peak has some more area under the curve than the left bell shape. In between the numbers 5 and 6 there is very little data. For this reason the median of population A is close to the number 6. The density of population B can be obtained from that of population A by flipping it horizontally and translate it horizontally to a position that makes the high peak centered around the number 5.5. The median of population B will be close to the number 6. Thus the two medians will be close to each other and they will be positioned symmetrically around the number 6. The two bell shapes of the one population are precisely there where the other has no data. Thus, when sampling from these populations, one may expect patterns such as described in the examples above. Note that, although the two population densities are congruent in the sense of Euclidean geometry, the shift condition is not met here. The evil described in this section is not to be expected when sampling from populations that do meet the conditions in the Mann-Whitney test. However, even then it cannot be excluded.

7.9

Exercises

Exercises without SPSS or R 1. The table below presents BMI measurements of 7 children. For every child the BMI was measured at two different points in time. CHILD

BMI-1

BMI-2

1 2 3 4 5 6 7

29.1 28.6 27.8 30.3 27.5 28.3 29.0

27.3 24.2 28.0 26.7 26.5 26.8 27.0

Run a Wilcoxon test manually. Suitable tables can be found on page 430. 2. Consider the measurements in the table of the previous exercise as being unpaired and apply a Mann-Whitney test manually. 3. The intervals of time in between epilepsy attacks were measured for one patient. The periods were coded L or S, standing for a ‘long’ or a ‘short’ interval of time. Thus the following sequence emerged: LLLSSLSSSLLLLLSS

251 Chapter 7.

Nonparametric statistics

Does this sequence provide enough evidence to reject the hypothesis of random ordering? 4. Given is the following dataset, presenting 6 measurements as to blood pressure: 132 157 143 125 161 137 Create a bootstrap book consisting of 3 pages and determine the excerpt with respect to the median. Also determine the excerpt with respect to the variance.

Exercises in SPSS 1. For each of the datasets 7.1a.sav and 7.1b.sav do the following: a) Check whether the conditions for a 2-sample t-test are met. b) Allowed or not, run a 2-sample t-test. c) Run a Mann-Whitney test by following the following menu path: Analyze → Nonparametric Statistics → 2 Independent Samples → Mann-Whitney... d) Compare the results in b) and c). 2. Open dataset 7.2.sav in SPSS. Apply a runs test to see whether the numbers 1 and 2 in the sequence x are clustering or mixing up significantly. 3. For the datasets 7.3a.sav, 7.3b.sav and 7.3c.sav do the following: a) Check whether the two ANOVA conditions are met. b) Allowed or not, run an ANOVA. c) Run the Kruskal-Wallis test by following the following menu path: Analyze → Nonparametric Tests → K Independent Samples... d) Compare the p-values found in b) and c).

7.9 Exercises

252

4. In the dataset 7.4.sav the column income presents a (would-be) sample as to the income of 100 randomly chosen students in Utrecht. The aim of this exercise is to make you aware of the risks involved when using transformation techniques in parametric statistics. a) Compute the mean income in this sample. b) Ignoring all conditions, compute an interval estimate of coverage 95% for the mean income of all students in Utrecht. c) Is the interval found in b) reliable? d) Transform the column income into a column logincome by taking the logarithms of the values in income. e) Compute an interval estimate of coverage 95% for the mean of log transformed student incomes in Utrecht. f) On the interval found in e), carry out a backward transformation to obtain an interval estimate for the mean income of all students in Utrecht. g) Is the interval found in f) reliable?

Exercises in R 1. Consider the data in the following table as being paired. x 0.80 0.83 1.89 1.04 1.45

y 1.15 0.88 0.90 0.74 1.21

Run Wilcoxon’s rank test. 2. This exercise leans on the built-in dataset sleep. a) Load the built-in dataset sleep into your workspace and get some info about it. b) Make boxplots of the two groups in one figure. c) Run a Mann-Whitney test. d) Was a t-test allowed here?

253 Chapter 7.

Nonparametric statistics

3. Get the built-in dataset faithful and attach the headers of this dataset to your workspace. Determine the median of the variable eruptions. Then recode the variable in symbols 0 and 1, where 1 means that the duration of the eruption was above the median and 0 that it was below. Apply a runs test to see whether there is significant clustering in the 0 − 1 sequence. Run a similar analysis on the variable waiting. 4. Load the built-in dataset water.RData by passing the command data(water) and attach the headers of this dataset to the search paths of R. The dataset describes mortality and drinking water hardness for 61 cities in England and Wales. The column mortality is the averaged annual mortality per 100000 male inhabitants and the column hardness presents the calcium concentration (in parts per million). The meaning of the remaining columns is self-explanatory. a) Make a boxplot in which the hardness of the water in northern and southern regions are compared. b) Compare the hardness of water in northern and southern regions in a t-test. Are the conditions to a t-test met? c) Run a Mann-Whitney test as an alternative for the t-test in part b). 5. Load the built-in dataset morley and get some info about it. a) Are the conditions to a 1-way ANOVA (with factor Expt) met here? b) If your answer in part a) was negative, then run a Kruskal-Wallis test as an alternative to a 1-way ANOVA. 6. In the dataset income.RData the column income presents a (would-be) sample as to the income of 100 randomly chosen students in Utrecht. The aim of this exercise is to make you aware of the risks involved when using transformation techniques in parametric statistics. a) Compute the mean income in this sample. b) Ignoring all conditions, compute a 95% confidence interval for the mean income of all students in Utrecht. c) Is the interval found in b) reliable? d) Transform the column income into a column logincome by taking the logarithms of the values in income. e) Compute a 95% confidence interval for the mean of log transformed student incomes in Utrecht. f) On the interval found in e), carry out a backward transformation to obtain an interval estimate for the mean income of all students in Utrecht.

7.10 Answers to the exercises

254

g) Is the interval found in f) reliable? h) Compute a 95% confidence interval for the mean income through bootstrapping. Also make a histogram of the bootstrap means. i) The same question as above, but now for the median. 7. Generate via the function rnorm() a sample x of size 50 from a Gaußian distributed population with mean 175 and standard deviation 10. a) Compute a parametric 95% confidence interval for the population mean. b) Compute a bootstrap 95% confidence interval for the population mean. c) Compute a bootstrap 95% confidence interval for the population variance.

7.10

Answers to the exercises

Answers to the exercises without SPSS or R Answers to Exercise 1 The list of differences with corresponding rank numbers is DIFFERENCE

RANK

1.8 4.4 −0.2 3.6 1.0 1.5 2.0

4 7 1 6 2 3 5

From the above it can be distilled that W + = 4 + 7 + 6 + 2 + 3 + 5 = 27

and

W− = 1

From the table for Wilcoxon’s signed rank test it can be read off that (when testing at α = 0.05) the left critical bound is 2. It follows that the right critical bound is 26. The computed value for W + is 27. This is beyond the right bound. The null hypothesis (median of differences is zero) must therefore be rejected.

255 Chapter 7.

Nonparametric statistics

Answers to Exercise 2 A table of rank numbers would be BMI-1

RANK

BMI-2

29.1 28.6 27.8 30.3 27.5 28.3 29.0

(13) (11) (8) (14) (7) (10) (12)

27.3 24.2 28.0 26.7 26.5 26.8 27.0

RANK

(6) (1) (9) (3) (2) (4) (5)

The sum T1 of the rank numbers belonging to BMI-1 is T1 = 13 + 11 + 8 + 14 + 7 + 10 + 12 = 75 In a 2-sided rank test, at a significance level of 95%, the critical bounds for T1 are 36 and 69. The computed value of 75 is beyond the right bound. The null hypothesis in the test (equal medians) must therefore be rejected. More colloquially, the medians of BMI-1 and BMI-2 differ significantly. Answers to Exercise 3 The sequence consists of 9 symbols L and 7 symbols S. There are 6 runs in the given sequence. Under randomness one could expect (on the average) a number of runs close to 2×9×7 126 +1= + 1 = 8.875 9+7 16 The counted number of 6 runs is indeed below the expected number of runs given above. Is there enough evidence to reject the null hypothesis (randomness)? The critical bounds, when testing at a confidence level of 95%, can be read off the table on page 433 to be 4 and 14. The actual count of 6 runs is not beyond these bounds. The null hypothesis in this test must therefore be maintained. That is to say, there is not enough evidence to reject randomness.

Answers to the exercises in SPSS Answers to SPSS-exercise 1 a) As to dataset 7.1a.sav, the conditions to run a 2-sample t-test are met. As to dataset 7.1b.sav, the conditions to run a 2-sample t-test are not met. There is no homogeneity of variance. In a 2-sample t-test a p-value can be computed that is corrected for inhomogeneity of variance.

7.10 Answers to the exercises

256

b) When running a t-test on the data in 7.1a.sav the test returns a p-value of 0.040. When running a t-test on the data in 7.1b.sav the test returns a (corrected) p-value of 0.017. c) When running a Mann-Whitney test on the data in 7.1a.sav a p-value of 0.044 is returned. When running a Mann-Whitney test on the data in 7.1b.sav a p-value of 0.048 is returned. d) In cases where the conditions to a 2-sample t-test are met the t-test and the MannWhitney test usually return p-values that do not differ too much. Answers to SPSS-exercise 2 A runs test with a user-specified cut point of 1.5 returns an approximate p-value of 0.261. An exact p-value is 0.247. The data provides no evidence against the null hypothesis of randomness. Answers to SPSS-exercise 3 The p-values in the various tests are summarized below: DATASET

LEVENE ’ S TEST

7.3a.sav 7.3b.sav 7.3c.sav

0.338 0.804 0.018

KS - TEST

ANOVA

KW- TEST

0.990 0.135 0.933

0.075 0.041 0.035

0.107 0.017 0.036

The p-values in the KS-test are approximate. The exact p-values are 0.983, 0.120 and 0.909 respectively. Answers to SPSS-exercise 4 a) The sample mean is 685. b) The confidence interval in question is (529; 841). c) The interval in b) is unreliable because normality conditions are not met. e) When using logarithms with base 10 the interval on this side of the transformation is (2.57; 2.72); when using natural logarithms the interval is (5.91; 6.27). f) A backward transformation leads to the interval (102.57 ; 102.72 ) = (368; 530)

257 Chapter 7.

Nonparametric statistics

g) The interval does not even contain the point estimate in a) as a member. So it cannot be considered to be a margin around the point estimate in a). The interval in f) could serve the purpose of an estimate for the geometric mean of the population, not for the arithmetic mean.

Answers to the exercises in R Answers to R-exercise 1 To run a Wilcoxon signed rank test, give the command wilcox.test(x,y,paired=TRUE) The test returns a p-value of 0.625 Answers to R-exercise 2 To run a Mann-Whitney test, give the command wilcox.test(extra~group) The test returns a p-value of 0.06933. A 2-sample t-test was also allowed here. Answers to R-exercise 3 First determine the median of eruptions. m