2.12 Appendix: Descriptive Statistics with R

51 downloads 180 Views 771KB Size Report
2.12 Appendix: Descriptive Statistics with R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004. Virtually every statistical computing package can ...
2.12 Appendix: Descriptive Statistics with R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 Virtually every statistical computing package can present graphical displays of data and report summary statistics. We demonstrate how to use R to produce some of the graphs and summary statistics discussed in Chapter 2. Consider first the red pine seedling data. We can enter the data in R as > redpine = c(42, 23, 43, 34, 49, 56, 31, 47, 61, 54, 46, 34, 26) > redpine [1] 42 23 43 34 49 56 31 47 61 54 46 34 26 > length(redpine) [1] 13 The first command line creates the redpine object, while the second line prints it so that we can check it. It never hurts to check the length of an object as well, which is just the sample size n. We can produce a stem and leaf display using the stem command in R. > stem(redpine) The decimal point is 1 digit(s) to the right of the | 2 3 4 5 6

| | | | |

36 144 23679 46 1

Notice that there are 5 categories for these 13 numbers, with stems for the 10s digit and leaves for the 1s digit. The stem and leaf plot can be scaled to have more stems by changing the scale option: > stem(redpine, scale = 2) The decimal point is 1 digit(s) to the right of the | 2 2 3 3 4 4 5 5 6

| | | | | | | | |

3 6 144 23 679 4 6 1 1

The stem command always produces ordered stem and leaf displays. The command for constructing histograms, hist(redpine), can take arguments that control the specific form of the display. As we noted, this is particularly important for histograms. Here is a density scaled histogram (prob=TRUE) of the red pine data: > hist(redpine, prob = TRUE, main = "Heights (cm) of red pine seedlings")

0.02 0.00

0.01

Density

0.03

0.04

Heights (cm) of red pine seedlings

20

30

40

50

60

70

redpine

Currently there is no dotplot command in R. However, you can get essentially the same think by using a large value for the breaks argument to the hist command. We leave further details to the reader. > hist(redpine, breaks = 100, main = "dot plot of red pine seedling heights (cm)")

1.0 0.0

Frequency

2.0

dot plot of red pine seedling heights (cm)

30

40

50

redpine

2

60

The easiest method for obtaining useful numerical summary statistics is to use the summary command, which includes the commonly used measure of location (quartiles, median and mean) as well as the minimum and maximum values. > summary(redpine) Min. 1st Qu. 23 34

Median 43

Mean 3rd Qu. 42 49

Max. 61

The R commands mean(redpine) and median(redpine) give you the corresponding values on their own. The summary command includes some raw ingredients for measures of spread. There are separate commands for the SD, IQR and range: > sd(redpine) [1] 11.75443 > IQR(redpine) [1] 15 > diff(range(redpine)) [1] 38 The range command returns the smallest and largest values, and their difference is calculated by the diff command. The interquartile range (IQR) is the difference between the third and first quartiles. Recall that the variance is the square of the standard deviation, which you can verify: > var(redpine) [1] 138.1667 > sd(redpine)^2 [1] 138.1667 You can reorganize these measures of spread by hand using your favorate word processor, or you can use R to help. For instance, > c(SD = sd(redpine), IQR = IQR(redpine), Range = diff(range(redpine)))

3

SD IQR Range 11.75443 15.00000 38.00000 Let us briefly reconsider the data set giving the watershed areas of 18 lakes in Northern Wisconsin. Here is the stem and leaf display. > watershed = c(0.3, 0.3, 0.4, 0.5, 0.5, 1.3, 1.6, 1.8, 2.6, 2.6, + 2.6, 4.7, 9.1, 9.1, 12.9, 19.4, 33.7, 176.1) > stem(watershed) The decimal point is 2 digit(s) to the right of the | 0 0 1 1

| 00000000000011123 | | | 8

R does some rounding of the data values so, for example, all the observations with area below 10 km2 are represented in the display with stem “0” and leaf “0”. As noted earlier in this chapter, these data are highly skewed. It is quite straightforward to transform these data, say with the logarithm (to the base 10) using the log10 command. (Logarithms to the base e are produced using the ln command.) We proceed as follows. > logshed = log10(watershed) > stem(logshed) The decimal point is at the | -0 0 1 2

| | | |

55433 1234447 00135 2

Alternatively, we could have done this in one step: > stem(log10(watershed)) The decimal point is at the | -0 0 1 2

| | | |

55433 1234447 00135 2

4

4.4 Appendix: Computing Probabilities in R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 R can be used to compute probabilities of interest associated with numerous probability distributions. In this section we describe its use for calculating probabilities associated with the binomial, Poisson, and normal distributions. First, we discuss computing the probability of a particular outcome for discrete distributions. (This can also be used with continuous distributions for computing probability densities but we will not be concerned with that application here.) R begins these commands with the letter d for distribution or density, namely dbinom and dpois. Consider again the calf example with abnormal clotting discussed near the beginning of the section on the binomial distribution. The random variable is X ∼ B(4, p). In the text we indicated that if p = 0.3, we can calculate P (X = 4) = 0.0081. We can use the dbinom command to perform this calculation. > dbinom(4, 4, 0.3) [1] 0.0081 The number 4 after the dbinom command indicates that X = 4 is the value for which the probability is required. The second and third arguments specify the n and p parameters of the binomial distribution, 4 and 0.3 respectively in this case. The resultant probability is 0.0081 as expected. In the text we also considered the case X ∼ B(8, p) and computed P (X = 6) with p = 0.3. We do the same with R. > dbinom(6, 8, 0.3) [1] 0.01000188 Here is an example of the use of the dpois command with a Poisson random variable. Recall the cake example. Here the random variable is X ∼ P o(2). Calculating by hand, we found P (X = 3) to be 0.1804. Here is the result using R: > dpois(3, 2) [1] 0.1804470 As noted, since the Poisson is defined by only a single parameter, we only need that parameter (λ) in the second argument. R starts distribution commands with the letter p, for probability, to compute P (X ≤ x) for any value x. We illustrate this command with the binomial and normal distribution. For the binomial situation, consider again the calf example. In Section 4.1 we computed P (X ≤ 1) for X ∼ B(8, 0.3). Using the pbinom command we proceed as follows. 1

> pbinom(1, 8, 0.3) [1] 0.2552983 The input is very similar except that pbinom replaces dbinom. Note that since the binomial distribution is discrete, there is a difference between P (X < 1) and P (X ≤ 1). The pbinom command always includes the upper limit. When we consider normal random variables, we need to point out a distinction between the pnorm command in R and the tables in this book. The pnorm command finds the probability to the left of a particular value, while the normal tables in this book give probabilities to the right of a particular value. (Probabilities to the right of a given value are of direct utility in testing as we will see in Chapter 6.) You can use the lower.tail=FALSE argument to get the upper tails with the pnorm, as shown below. Recall the topsoil example in which X represents the pH of a random sample of topsoil in the vicinity of Oxford. In this example X ∼ N (7.1, 0.62 ). As always, the parameters in this notation are the mean, µ and the variance σ 2 . We wish to find P (X ≤ 6.2). In R we enter: > pnorm(6.2, 7.1, 0.6) [1] 0.0668072 > pnorm(6.2, 7.1, 0.6, lower.tail = FALSE) [1] 0.9331928 The second and third arguments are µ and σ, respectively. Thus, we must be careful to type in the standard deviation and not the variance. Note that with the normal distribution (as with any continuous distribution) P (X ≤ 6.2) = P (X < 6.2). When the argument lower.tail is set to FALSE, we get P (X > 6.2) in the second command line. Finally we can make the inverse calculation in which we calculate the point x so that P (X ≤ x) is equal to some given probability. The R commands for this begin with the letter q for quantile. This command is awkward to use with a discrete distribution because of the discrete jumps in probability. However, it is easy and useful to use the qnorm command with the continuous normal distribution. For example, consider again the GRE example with X ∼ N (485, 1232 ). We wish to find x so that P (X ≤ x) = 0.25. We proceed as follows. > qnorm(0.25, 485, 123) [1] 402.0378 The answer from R is not rounded to the nearest integer as was done in our example. This recalls our caution that a normal model may not be totally correct for the GRE situation even though it gives results that are useful in practice. 2

5.7 Appendix: Using R for Sampling Distributions by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 In Section 5.3 on the Central Limit Theorem, we presented a computer simulation to illustrate the CLT. Computer simulation is a very useful tool in statistics; its importance continues to grow as the computer opens up more approaches for describing and analyzing data. Roughly speaking, computer simulation is important in statistics in two ways. The first is to demonstrate results like the CLT. The second is to describe, through simulation, distributions that cannot be described explicitly by formulas. (This includes methods like the bootstrap and Markov chain Monte Carlo simulation.) In this volume we will restrict attention to the demonstration aspect of computer simulation. The third subsection of this Appendix shows how to compute probabilities for the distribution of the sample variance for normal data. This is similar in spirit to the Appendix of Chapter 4.

5.7.1 Simulations using a Discrete Distribution ¯ = σ 2 /n. Consider a Let us first consider a simulation example that illustrates Var(X) discrete random variable with probability function given by the following. x 1 3 5

p(x) 0.6 0.3 0.1

By using the methods from Sections 3.6 and 3.7, we find that E(X) = 2.0 and Var(X) = 1.8. Now let us use simulation to generate 500 values from this distribution. The commands to perform this simulation are given below. The sample command instructs R to generate 500 random values and place them in the draws. The first argument is the possible x values, while the prob argument specifies their probabilities. The replace argument is set to TRUE as we want to sample with replacement. > x = c(1, 3, 5) > px = c(0.6, 0.3, 0.1) > draws = sample(x, size = 500, replace = TRUE, prob = px) Here is a histogram of the 500 values in object draws. (The argument breaks is used to allow easy comparison of the 3 histograms we present in this section based on this discrete distribution.) > hist(draws, breaks = seq(1, 5, by = 0.25), main = "1000 discrete draws")

1

150 0

50

Frequency

250

1000 discrete draws

1

2

3

4

5

draws

From the probability distribution we would expect 60% of the observations to have value “1”, 30% to have value “3” and 10% to have value “5.” Thus, we would expect 300, 150, and 50 observations for each of these three numbers. Our simulated histogram is close to this although the lowest category (which represents the “1’s”) and the highest category have somewhat fewer observations than expected whereas the middle category has somewhat more than expected. If we were to simulate another 500 observations, the number of observations in each category would be somewhat different although the general shape of the histogram would be the same. Although we know by construction that the variance of this distribution is 1.8, we can find the variance of the 500 simulated observations by having R compute the variance. > var(draws) [1] 1.819238 Thus, the variance of the 500 simulated observations is 1.819, close to the theoretical value. Let us now use simulation to generate 500 simulated values of x¯ where x¯ is the mean of 4 observations from the same discrete distribution. It is straightforward to perform the simulation using the following commands. > draws = sample(x, size = 4 * 500, replace = TRUE, prob = px) > draws = matrix(draws, 4) > drawmeans = apply(draws, 2, mean) The commands in the first and second lines generate an object named draws with 4 rows of numbers each with 500 values from the discrete distribution used above. (In total we have 2

generated 4*500 = 2000 observations. Notice that we wrote over the original draws to save space.) The 500 values are in columns numbered from 1 to 500. Think of each column as having 4 observations from the distribution. The command in the third line applies the mean command to every column, using the (apply command. The 500 values of the mean are now in object drawmeans. Here is the histogram of these 500 simulated values of x¯. > hist(drawmeans, breaks = seq(1, 5, by = 0.25), main = "1000 means of 4 draws")

100 60 0 20

Frequency

1000 means of 4 draws

1

2

3

4

5

drawmeans

We can see that the variance is visibly smaller. By using the var command we find that the variance of these 500 means is 0.53. This is fairly close to the value of 0.45 that we would ¯ = σ 2 /4 = 1.8/4 = 0.45. expect: Var(X) Let us now simulate 500 values of x¯ where x¯ is now the mean of 16 observations. We present a condensed set of R commands to generate the means in object drawmeans and then display the histogram. For reasons of efficiency, we > drawmeans = apply(matrix(sample(x, size = 16 * 500, replace = TRUE, + prob = px), 16), 2, mean) > hist(drawmeans, breaks = seq(1, 5, by = 0.25), main = "1000 means of 16 draws")

3

100 60 0 20

Frequency

140

1000 means of 16 draws

1

2

3

4

5

drawmeans

The distribution of these 500 values of x¯ indicates a much reduced variance. The (theoretical) ¯ is Var(X) ¯ = σ 2 /16 = 1.8/16 = 0.1125. For the particular simulation performed variance of X (the values in drawmeans), the observed variance is 0.117, again fairly close to the theoretical value. Thus, we have used the simulation capabilities of R to demonstrate visually (from the histograms) and numerically (from the realized variances) the impact of the sample size, ¯ n, on Var(X). We can also see an illustration of the Central Limit Theorem in the last histogram. With x¯ values computed from the mean of 16 observations from a particular discrete distribution, the distribution of these sample means shown in the histogram looks approximately like a normal distribution. In Section 5.3 we displayed a histogram of 2000 values of x¯ from another discrete distribution. This was done using the same procedure we used here. Of course the particular discrete distribution must be entered into R as the necessary first step. It is sometimes difficult to know how many values to generate is a simulation study. In our example in this section we used 500; in Section 5.3 we used 2000. A simulation itself is random, and the number of simulations done helps to control that randomness. Having said that, the number of simulations depends on the purposes of the simulation. We will not explore this issue further, except to note that, for a simulation used for demonstration purposes, 500 to 2000 is a reasonable range for the number of values. When simulation is used directly for inference, other considerations may dictate the need for a different number (usually larger).

4

5.7.2 Simulations using a Continuous Distribution Here is another example of simulation for demonstration purposes. We will simulate values of V 2 corresponding to S 2 , the sample variance from normal data. Assume that the underlying distribution X is distributed as X ∼ N (0, 9) and suppose that the sample size, n, is 6. The sequence of R commands for generating 1000 values from V 2 with 5 degrees of freedom is as follows. > draws = matrix(rnorm(1000 * 6, 0, 3), 6) > drawvar = apply(draws, 2, var) The commands in the first 2 lines generate an object named draws with 6 rows and 1000 columns of normal observations where the normal observation has mean 0 and standard deviation 3. ( Recall that rnorm requires the standard deviation, not the variance.) The third line applies the var command to each column using the apply command to create the 1000 values of S 2 . We now present the histogram for these 1000 values of V 2 = (n − 1)S 2 /σ 2 . > > > >

draws = 5 * drawvar/9 hist(draws, breaks = 20, prob = TRUE, main = "standard distribution for sample varian v = seq(0, max(draws), length = 200) lines(v, dchisq(v, 5), lty = 2, lwd = 2)

0.10 0.05 0.00

Density

0.15

standard distribution for sample variance

0

5

10

15

20

draws

Not surprisingly, the shape of this simulated distribution is very close to the shape of the theoretical distribution for χ25 shown in the figure in Section 5.5 and overlaid as a dashed lines here by the last two command lines.

5

5.7.3 Computing Probabilities for the Variance in R In the Appendix to Chapter 4, we showed how to compute probabilities for the mean of a normal distribution. Here we show similar calculations for the distribution of the sampling variance for normal data. Consider again the pine seedlings, where we had a sample of 18 having a population mean of 30 cm and a population variance of 90 cm2 . What is the probability that S 2 will be less than 160? > > > >

n = 18 pop.var = 90 value = 160 pchisq((n - 1) * value/pop.var, n - 1)

[1] 0.9752137 Notice where the sample size (n = 18), population variance (pop.var = 90) and value of interest (value = 160) appear in the pchisq command. The p-value of 0.975 agrees with the p-value shown in Section 5.5. As with other probability commands, the upper tail could have been calculated using the option lower.tail=FALSE. Now consider the fruit company problem with weight of apple sauce in grams having distribution X ∼ N (275, 0.0016). Here we want to take a random sample of 9 jars and find the s2 so that P (S 2 ≤ s2 ) = 0.99. The following R command does this: > > > >

pop.var = 0.0016 n = 9 prob = 0.99 pop.var * qchisq(prob, n - 1)/(n - 1)

[1] 0.004018047 Again notice where the sample size (n = 9), probability level (prob = 0.99) and population variance (pop.var = 0.0016) appear in the calculation. [Why do the variance and sample size appear outside of the command qchisq?] The value 0.004 agrees with earlier calculations in Section 5.5. Remember that probability calculations for the sample variance rely heavily on the assumption of normality. If the data distribution is not normal, then these probabilities may be way off.

6

6.9 Appendix: Conducting Tests in R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 R has a wide range of commands suited to testing. Here we focus on testing the population mean µ and the population variance σ 2 for normal data, and the population proportion p for binomial data. Section 6.9.1 shows examples of the T-test and Z-test under a variety of settings, depending on what you have available. Section 6.9.2 shows how to examine the distribution of p-values for the t.test with simulated data. Section 6.9.3 shows how to test the binomial proportion. Finally, Section 6.9.4 shows how to test the population variance for normal data.

6.9.1 Tests for Population Mean with Normal Data The T-test introduced in this chapter tests a population mean for normal data when the variance is unknown. The t.test can be used for tests of the mean when you have all the data. pt and qt are useful to get probabilities and quantiles, respectively, when you only have the sample mean and sample variance. When the population variance is known, we can use the pnorm command in a manner similar to Appendix 4.4 to conduct tests on the population mean. We demonstrate the use of this command for the white pine seedling data originally introduced in Chapter 2. In this chapter, we discussed hypothesis testing for these data in Subsection 6.3.2. Of interest was the null hypothesis H0 : µ = 40. In that subsection there were two white pine seedling data sets, collected from two different nurseries. For purposes of illustration we focus on data from the first nursery (Set A). > whitepine = c(61, 17, 38, 32, 30, 38, 25, 46, 38, 27, 43, 31, + 34, 41, 27, 22, 40, 22) We enter the t.test command, specifying the null hypothesis of interest. > t.test(whitepine, mu = 40) One Sample t-test data: whitepine t = -2.4258, df = 17, p-value = 0.02669 alternative hypothesis: true mean is not equal to 40 95 percent confidence interval: 28.78161 39.21839 sample estimates: mean of x 34

1

In Subsection 6.3.2 we concluded that the p-value was between 0.02 and 0.05. Using R, we have been able to specify the p-value more precisely as 0.027. Note that this additional precision does not change our conclusions about the strength of evidence against H0 . However, using the t.test command has saved us some work. The t.test command has an option for conducting a one-sided test. Although we have argued that two-sided tests arise more frequently, there are occasions when one-sided tests are of interest. The R command for getting the left-sided test is > t.test(whitepine, mu = 40, alternative = "less") One Sample t-test data: whitepine t = -2.4258, df = 17, p-value = 0.01335 alternative hypothesis: true mean is less than 40 95 percent confidence interval: -Inf 38.30272 sample estimates: mean of x 34 The right-sided test uses option alternative="greater", giving a p-value of 0.987 T-test using only sample mean and variance Suppose we only knew that y¯ = 34 and s = 10.49 for the white pine seedling data. How could we conduct a T-test that µ = 40? We can use the pt command. The command structure is similar to previous commands discussed in Appendices 4.4 (pnorm, qnorm) and 5.7.3 (pchisq, qchisq). For the white pine seedling T-test, first construct the T-value, √ t = (¯ y − 40)/(s/ n) = -2.43. > ybar = mean(whitepine) > ybar [1] 34 > s = sd(whitepine) > s [1] 10.49370 > > > >

n = 18 mu = 40 t.value = (ybar - mu)/(s/sqrt(n)) t.value 2

[1] -2.425823 Then look up the upper tail for the absolute value of t and double it to get the two-sided p-value. Here we use the abs command just to get the absolute value first, removing the minus sign to make sure we are looking in the extremes of the upper tail. > 2 * pt(abs(t.value), n - 1, lower.tail = F) [1] 0.02669288 Suppose you wanted instead to find the t.value, t, corresponding to the upper α = 0.10 level, P (T > t) = 0.10, for the white pine problem. To find t you would use the qt command, much as in Appendix 4.4, and enter > alpha = 0.1 > qt(alpha, n - 1, lower.tail = F) [1] 1.333379 Z-test with known variance R does not have a command for the significance testing of the mean of normal data with known variance. Instead we can use the pnorm command. Suppose we believed the variance of the white pine seedling heights was 100, hence the variance of the mean is 100/18. The left-sided p-value is found by > pop.var = 100 > pnorm(ybar, mu, sqrt(pop.var/n)) [1] 0.005454749 The two-sided p-value is just double this, or 0.011.

6.9.2 Sampling Distribution of the p-value from T-tests What is the distribution of the p-value for a test of mean when the null hypothesis is true? Suppose we draw samples of size 18 of simulated white pine seedlings from N (40, 100) and suppose we use a T-test to find evidence whether µ = 40 or not. Line 1 of the simulation below is like Appendix 5.7.2, drawing 200 samples of size 18, one sample per column. The second line creates a command get.p.value to get the p-value (element p.value) from a t.test that the mean is mu=40. The third line uses apply to get the p-value by column from the draws. Finally, the last line finds out how many p-values are at or below 0.05.

3

> > > > > >

n.draw = 200 alpha = 0.05 draws = matrix(rnorm(n.draw * n, mu, sqrt(pop.var)), n) get.p.value = function(x) t.test(x, mu = mu)$p.value pvalues = apply(draws, 2, get.p.value) sum(pvalues > > >

y = 83 n = 100 p = 0.75 prop.test(y, n, p) 1-sample proportions test with continuity correction

data: y out of n, null probability p X-squared = 3, df = 1, p-value = 0.08326 alternative hypothesis: true p is not equal to 0.75 95 percent confidence interval: 0.7389130 0.8950666 sample estimates: p 0.83

6.9.4 Test of Population Variance for Normal Data R commands for testing hypotheses about σ 2 for normal data can be calculated using the methods of this chapter and the probability commands described in the Appendix to Chapter 5 to find the p-value. For instance, a test whether H0 : σ 2 = 100 vs. HA : σ 2 > 100 for the white pine data has the following p-value: > n = 18 > sample.var = var(whitepine) > sample.var [1] 110.1176 4

> pop.var = 100 > pchisq((n - 1) * sample.var/pop.var, n - 1, lower.tail = FALSE) [1] 0.3448386

5

7.6.2 Appendix: Using R to Find Confidence Intervals by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 The tinterval command of R is a useful one for finding confidence intervals for the mean when the data are normally distributed with unknown variance. We illustrate the use of this command for the lizard tail length data. > lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, + 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, + 11.3, 11.9) If we use the t.test command listing only the data name, we get a 95% confidence interval for the mean after the significance test. > t.test(lizard) One Sample t-test data: lizard t = 30.4769, df = 23, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 8.292017 9.499649 sample estimates: mean of x 8.895833 Note here that R reports the interval using more decimal places than was used in Subsection 7.1.2. Because the data were recorded to a single decimal, this extra precision is unnecessary. The t.test command can also be used to find confidence intervals with levels of confidence different from 95%. We do so by specifying the desired level of confidence using the conf.level option. > t.test(lizard, conf.level = 0.9) One Sample t-test data: lizard t = 30.4769, df = 23, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: 8.395575 9.396092 sample estimates: mean of x 8.895833 1

The R command prop.test can be used similarly to construct confidence intervals for the normal approximation to the binomial. > prop.test(83, 100, 0.75) 1-sample proportions test with continuity correction data: 83 out of 100, null probability 0.75 X-squared = 3, df = 1, p-value = 0.08326 alternative hypothesis: true p is not equal to 0.75 95 percent confidence interval: 0.7389130 0.8950666 sample estimates: p 0.83 R does not have a command to find confidence intervals for the mean of normal data when the variance is known. Because this arises rarely in practice, we could skip this. For those interested, the following command lines create a new command norm.interval based on material from this chapter. We apply it to the lizard data, assuming we know ahead that the variance is 2. > norm.interval = function(data, variance = var(data), conf.level = 0.95) { + z = qnorm((1 - conf.level)/2, lower.tail = FALSE) + xbar = mean(data) + sdx = sqrt(variance/length(data)) + c(xbar - z * sdx, xbar + z * sdx) + } > norm.interval(lizard, 2) [1] 8.330040 9.461626 Similar calculations, or a similar function, could be developed for confidence intervals for the variance of a normal distribution. We illustrate this for the variance of the lizard data. > var.interval = function(data, conf.level = 0.95) { + df = length(data) - 1 + chilower = qchisq((1 - conf.level)/2, df) + chiupper = qchisq((1 - conf.level)/2, df, lower.tail = FALSE) + v = var(data) + c(df * v/chiupper, df * v/chilower) + } > var.interval(lizard) [1] 1.235162 4.023559 2

Sampling Confidence Intervals Here is a way to see how confidence intervals are random. Based on the lizard data, we draw 100 random samples with mean 9 and SD the same as the lizards. > > > > >

n.draw = 100 mu = 9 n = 24 SD = sd(lizard) SD

[1] 1.429953 > draws = matrix(rnorm(n.draw * n, mu, SD), n) Now we construct 95% confidence intervals for each sample. The first line creates a local command get.conf.int to extract the confidence interval (conf.int) from the t.test command. The second line uses the apply command to apply get.conf.int to every column of draws. Finally, we count the number of confidence intervals that cover µ = 9. > get.conf.int = function(x) t.test(x)$conf.int > conf.int = apply(draws, 2, get.conf.int) > sum(conf.int[1, ] = mu) [1] 94

3

Here is a figure showing the 100 confidence intervals as horizontal lines, with a vertical line at the population mean of 9.

60 40 20 0

sample run

80

100

> plot(range(conf.int), c(0, 1 + n.draw), type = "n", xlab = "mean tail length", + ylab = "sample run") > for (i in 1:n.draw) lines(conf.int[, i], rep(i, 2), lwd = 2) > abline(v = 9, lwd = 2, lty = 2)

8.0

8.5

9.0

9.5

mean tail length

4

10.0

8.5 Appendix: Normal Score Plots in R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 In this chapter we have used both Minitab and R to construct normal scores plots. In R this is done using the function qqnorm, which automatically generates the plot. Before we show its use, we have three important points to note regarding normal scores plots made with computer packages. First, different packages handle ties in different ways. The different approaches result in similar-looking plots, however, so we will not dwell on this point. More noteworthy is the fact that some packages reverse the axes: they plot the population quantiles on the vertical axis and the sample quantiles on the horizontal axis. In these cases, the slope and intercept for such a plot have different meanings than for the plots we have described; however, the key point remains the same: data from a normal population should result in a plot where the points follow a straight line. Our third point is that, instead of using the population quantiles z[(i−0.5)/n] we discussed earlier, some computer packages use slight variants on this. For example, R uses z[(i−3/8)/(n+1/4)] for small sample sizes. The reasons for these variants are somewhat technical and do not concern us here. The differences in the resulting plots are very small. With these points in mind, we now describe the construction of normal scores plots in R. Consider some artificial data called mydata. To produce a normal scores plot, you type: > mydata = c(2.4, 3.7, 2.1, 3, 1.6, 2.5, 2.9) > myquant = qqnorm(mydata)

Normal Q−Q Plot

3.0



2.5









2.0

Sample Quantiles

3.5





−1.0

−0.5

0.0

0.5

1.0

Theoretical Quantiles

If the observations in mydata come from a normal distribution, then the above plot of mydata versus their population quantiles should give a straight line. It seems not unreasonable to conclude from this plot that the data come from a normal distribution. 1

The object myquant contains the quantiles (myquant$x) and the original data (myquant$y). Note that 1.6 is the smallest value in the data set. The corresponding population quantile z[(i−3/8)/(n+1/4)] from a standard normal distribution where i = 1 and n = 7. (You should . check that P (Z ≤ −1.36) = 0.0869 = 0.0862 = (1 − 3/8)/(7 + 1/4).) The quantiles can be viewed by printing the object myquant as a data frame: > data.frame(myquant) x y 1 -0.3529340 2.4 2 1.3644887 3.7 3 -0.7582926 2.1 4 0.7582926 3.0 5 -1.3644887 1.6 6 0.0000000 2.5 7 0.3529340 2.9

2

9.4b Appendix: Investigating Power with R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 We can use R to investigate power. We already have several tools for this. These commands for questions about the population mean when the variance is known (pnorm and qnorm from Appendix 4.4) or unknown (pt and qt from Appendix 6.9). In addition, we have briefly studied p-values in Chapter 6. Below we redo some examples from this chapter, and then show a property of p-values under a null and alternative.

9.4b.1 Power curve A power curve helps is evaluate our chances of rejecting the null hypothesis when it is false. Reconsider the foresters’ problem, where the null hypothesis is H0 : µ = 75, or ¯ ∼ N (75, 212 /15) based on random samples of 15 seedlings. The foresters posed the X question: what is the probability of rejecting H0 if the true terminal shoot mean length is something other than 75? Under the null hypothesis, the 5% rejection region is defined by > > > >

n = 15 pop.sd = 21 SD = pop.sd/sqrt(n) SD

[1] 5.422177 > mu = 75 > region = qnorm(c(0.025, 0.975), mu, SD) > region [1] 64.37273 85.62727 ¯ is below 64.37 or above 85.63. That is, we reject H0 if X The power for alternative HA : µ = 80 can be found by determining the tail probabilities for each rejection region > mu = 80 > pnorm(region[1], mu, SD) [1] 0.001975154 > pnorm(region[2], mu, SD, lower.tail = FALSE) [1] 0.1496757

1

and adding them together to get 0.1517. We can reproduce the power curve table in Section 9.2.1, up to roundoff error, with the following commands. Here mus is a sequence (using the seq command) of 55, 60, · · · , 95. The power is calculated as above using the pnorm command. The cbind command just binds these together as two columns in a table. > mus = seq(55, 95, by = 5) > cbind(mu = mus, power = pnorm(region[1], mus, SD) + pnorm(region[2], + mus, SD, lower.tail = FALSE)) [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,]

mu 55 60 65 70 75 80 85 90 95

power 0.9580589 0.7900102 0.4540217 0.1516509 0.0500000 0.1516509 0.4540217 0.7900102 0.9580589

In addition, we can reproduce the power curve plot by filling in a bit more (now mus steps through every 1 percent). The plot command plots the mus against the power (again calculated using two calls to pnorm). The other options to plot make the curve into a line (type="l") and provide axis labels (xlab and ylab).

0.6 0.2

Power

1.0

> mus = seq(50, 100, by = 1) > plot(mus, pnorm(region[1], mus, SD) + pnorm(region[2], mus, SD, + lower.tail = FALSE), type = "l", xlab = "Population mean (mu)", + ylab = "Power")

50

60

70

80

Population mean (mu)

2

90

100

9.4b.2 Histograms of p-values (optional) What shape is a histogram of p-values? First draw samples of size 18 of simulated white pine seedlings from N (40, 100) and use a T-test to find evidence whether µ = 40 or not as in Appendix 6.9.2. Here is a density-scaled histogram of the p-values for the test of µ = 40, with a horizontal dashed line corresponding to an exact uniform distribution. > > > > > > > > >

n.draws = 200 mu = 40 pop.sd = 10 n = 18 draws = matrix(rnorm(n.draws * n, mu, pop.sd), 18) get.p.value = function(x) t.test(x, mu = 40)$p.value pvalues = apply(draws, 2, get.p.value) hist(pvalues, breaks = 20, prob = TRUE) abline(h = 1, lwd = 2, lty = 2)

1.0 0.5 0.0

Density

1.5

Histogram of pvalues

0.0

0.2

0.4

0.6

0.8

1.0

pvalues

In this sample from the null H0 : µ = 40, there are 30 of 200 p-values less than or equal to 0.20. Suppose the random draws have mean µ = 37 instead of 40? Below is a histogram of pvalues for the test of H0 : µ = 40 when the mean is actually from the alternative HA : µ = 37. The horizontal dashed line is added to indicate roughly what a histogram would look like if the null hypothesis were true. But in this case, the null is false. Notice how the distribution skews toward small p-values. That is, we are more likely to get very small p-values than we would expect by chance under H0 . This would lead us to reject the null hypothesis 3

> > > > >

mu = 37 draws = matrix(rnorm(n.draws * n, mu, pop.sd), 18) pvalues = apply(draws, 2, get.p.value) hist(pvalues, breaks = 20, prob = TRUE) abline(h = 1, lwd = 2, lty = 2)

3 2 0

1

Density

4

5

Histogram of pvalues

0.0

0.2

0.4

0.6

0.8

1.0

pvalues

In this sample from the alternative HA : µ = 37 there are 109 of 200 p-values less than or equal to 0.20.

4

10.9 Appendix: The Use of R for Two-Sample Comparisons by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 This section shows a catalog of two-sample inference calculations using R. We begin with two-sample inference (testing and confidence intervals) under the usual assumptions. Then we show two-sample nonparametric tests, tests of proportions, and finally Levene’s test for unequal variance.

10.9.1 Two-sample Inference based on the T-test To perform a paired T -test with R, we could subtract the values for one treatment from the other and use the procedure we described in the appendix to Chapter 6. We illustrate using the data from the corn-breeding experiment from Section 10.2. > backcross = c(209, 193, 223, 212, 238, 211, 228) > inbred = c(202, 182, 221, 197, 233, 214, 218) > t.test(backcross - inbred) One Sample t-test data: backcross - inbred t = 2.951, df = 6, p-value = 0.02558 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 1.146891 12.281680 sample estimates: mean of x 6.714286 The t.test command gives the same answers we obtained by hand. (We note that this command gives us a confidence interval the difference.) Alternatively we can use both samples directly and the paired option to get the same result: > t.test(backcross, inbred, paired = TRUE) Paired t-test data: backcross and inbred t = 2.951, df = 6, p-value = 0.02558 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.146891 12.281680 1

sample estimates: mean of the differences 6.714286 The R command t.test can perform inference for two independent samples as well. In fact, the t.test command by default assumes you have independent samples (paired=FALSE). Note, however, that the t.test command will perform the test with the variances assumed unequal unless you specify otherwise. In order to perform the test with variance assumed equal, it is necessary to use the var.equal=TRUE option. To illustrate, we analyze the lizard data from Subsection 10.3.1 with Big Bend tail lengths in column c4 and Box Canyon lengths in c5. The output for both the test and a 95% confidence interval is as follows. > bigbend = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, + 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, + 11.3, 11.9) > boxcanyon = c(8.1, 8.8, 9, 9.5, 9.5, 9.8, 9.9, 10.3, 10.4, 10.6, + 10.7, 10.9, 10.9, 11.1, 11.4, 12) > t.test(bigbend, boxcanyon, var.equal = TRUE) Two Sample t-test data: bigbend and boxcanyon t = -3.0933, df = 38, p-value = 0.003701 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.1266512 -0.4441821 sample estimates: mean of x mean of y 8.895833 10.181250 T-test using only sample mean and variance Suppose you only had the sample sizes, means and the variances for two groups. You could compute the pooled variance, sdpool, as described in Section 10.3.1. Then you can compute the t-score and p-value in a similar manner to the one-sample test shown in Appendix 6. The t.test command does not actually show you the pooled standard deviation. You can compute the pooled variance as follows given the sample sizes and variances: > > > >

n1 = length(bigbend) n2 = length(boxcanyon) var1 = var(bigbend) var1

[1] 2.044764 2

> var2 = var(boxcanyon) > var2 [1] 1.064292 > sdpool = sqrt(((n1 - 1) * var1 + (n2 - 1) * var2)/(n1 + n2 + 2)) > sdpool [1] 1.287531 The t-value can be computed from the sample means, pooled SD, and samples sizes as follows: > ybar1 = mean(bigbend) > ybar1 [1] 8.895833 > ybar2 = mean(boxcanyon) > ybar2 [1] 10.18125 > t.value = (ybar1 - ybar2)/(sdpool * sqrt((1/n1) + (1/n2))) > t.value [1] -3.093299 Again, as in Appendix 6, we find the p-value using the pt command: > 2 * pt(abs(t.value), n1 + n2 - 2, lower.tail = F) [1] 0.003701436 Note that this agrees with the calculation above using the t.test. T-test with unequal variances To perform independent two-sample inference with the variances assumed unequal, we omit the var.equal option (the default is FALSE). For example, consider the compost data from Subsection 10.3.2 on the percent germination for composts A and B. The resultant commands and output are as follows. > compostA = c(24, 25, 26, 26, 27, 28, 28, 30, 33) > compostB = c(22, 32, 37, 40, 44, 47, 49, 51, 52, 56, 67) > t.test(compostA, compostB) 3

Welch Two Sample t-test data: compostA and compostB t = -4.6659, df = 11.214, p-value = 0.0006528 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -26.084991 -9.389757 sample estimates: mean of x mean of y 27.44444 45.18182

10.9.2 Two-sample Nonparametric Test The Mann-Whitney, or Wilcoxon, test is available in R as wilcox.test when you are not willing to assume normality. It works in a similar fashion to the t.test, which we illustrate with the DMBA drug study of Section 10.5.1. > drugA = c(14, 11, 37, 17, 21) > drugB = c(8, 13, 9) > wilcox.test(drugA, drugB) Wilcoxon rank sum test data: drugA and drugB W = 14, p-value = 0.07143 alternative hypothesis: true mu is not equal to 0 The Wilcoxon ”statistic” W=14 is slightly different from the Mann-Whitney statistic. It is the sum of the ranks for drugA (29) minus the smallest possible sum of ranks (15 = 1+2+3+4+5). This command can also be used for paired samples, such as the corn breeding data. It can either be used on the difference or on the pairs: > wilcox.test(backcross - inbred) Wilcoxon signed rank test data: backcross - inbred V = 26, p-value = 0.04688 alternative hypothesis: true mu is not equal to 0 > wilcox.test(backcross, inbred, paired = TRUE) Wilcoxon signed rank test data: backcross and inbred V = 26, p-value = 0.04688 alternative hypothesis: true mu is not equal to 0 4

10.9.3 Test of Two Proportions Two proportions can be compared with the R command prop.test already seen for one sample in Appendices for Chapters 6 and 7. For the fungicide experiment of Section 10.7.1, we can do > prop.test(c(54, 40), c(72, 67)) 2-sample test for equality of proportions with continuity correction data: c(54, 40) out of c(72, 67) X-squared = 3.0442, df = 1, p-value = 0.08103 alternative hypothesis: two.sided 95 percent confidence interval: -0.01568796 0.32165810 sample estimates: prop 1 prop 2 0.7500000 0.5970149 By default, this uses the continuity correction. We can avoid this using the correct=FALSE option, yielding a p-value of 0.054, which agrees with the value found in Section 10.7.1.

10.9.4 Levene’s two-sample test of unequal variance Unfortunately, R does not currently have a single command for performing Levene’s test for unequal variance. As a result at least part of these computations of Section 10.4 must be done “by hand.” For the adventurous, here is a home-made command in R to do all the calculations for Levene’s test: > levene.test = function(data1, data2) { + levene.trans = function(data) { + ## find median for group of data + ## subtract median; take absolute value + a = sort(abs(data - median(data))) + ## if odd sample size, remove exactly one zero + if (length(a)%%2) + a[a != 0 | duplicated(a)] + else a + } + ## perform two-independent sample T-test on transformed data + t.test(levene.trans(data1), levene.trans(data2), var.equal = TRUE) + }

5

The calculations in this new command levene.test are compact but subtle, not for everyone. However, for those interested, here is an explanation. Inside the levene.test command definition is another definition, for levene.trans. This does steps 1-3 of Levene’s test. First, we take the absolute value of the difference between the data and its median. Then, if the sample size n is odd, we keep all data except the first zero. (The n%%2 ”command” reads ”n modulo 2”, which is 0 if n is even and 1 if n is odd.) We illustrate Levene’s test with the bean seed planter data of Section 10.4: > planterI = c(1.2, 1.6, 1.7, 2.4, 2.4, 2.5, 2.6, 2.7, 2.7, 2.9, + 3.1, 3.2, 3.3, 3.7, 3.9, 4) > planterII = c(2.2, 2.3, 2.3, 2.5, 2.5, 2.5, 2.6, 2.7, 2.7, 2.8, + 2.8, 2.8, 2.8, 2.9, 3, 3.1, 3.1, 3.3, 3.4) > levene.test(planterI, planterII) Two Sample t-test data: levene.trans(data1) and levene.trans(data2) t = 2.6083, df = 32, p-value = 0.01372 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.07316947 0.59488609 sample estimates: mean of x mean of y 0.6062500 0.2722222 The original calculations had 0.01 ≤ p-value ≤ 0.02, while we now have p-value = 0.014.

6

11.9 Appendix: An Example of R Output for ANOVA by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 Here we briefly indicate how R can be used to perform the ANOVA analysis for several examples in this chapter.

11.9.1 One-way ANOVA We begin with the oneway.test command. To use this command it is necessary to have the data entered in two columns, one column for the treatment number and another for the data. We have already prepared file drug.dat in this fashion (available in the course data area). > drug = read.table("drug.dat") > group = factor(drug$V1) > y = drug$V2 Column V1 contains the treatment group number and column V2 contains the drug response. The factor command explicitly states that group is a categorical factor with discrete levels, not a number. We can use the onewayaov.test command to generate the desired output. This command is slightly different from the earlier test commands, as it requires a “formula”, in this case y∼group. The tilde (∼) separates the response (y) from the treatment identifier (group). > oneway.test(y ~ group, var.equal = TRUE) One-way analysis of means data: y and group F = 7.3837, num df = 4, denom df = 33, p-value = 0.000231 This output is rather condensed, showing only what you “need to know” for the test. You can get a more standard looking ANOVA table, but without the TOTAL line by the following: > drug.lm = lm(y ~ group) > anova(drug.lm) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 4 5.7028 1.4257 7.3837 0.000231 *** Residuals 33 6.3719 0.1931 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 1

This latter, more complicated method uses the linear model (lm) command. First, we explicity need the treatment group to be categorical, a factor in R language (we created it this way above). Second, we use the formula in the lm command to calculate a linear model object. Finally, we print the ANOVA table using the anova command. This method has its advantages as it opens up many new avenues for data analysis.

11.9.2 Graphical Summaries for ANOVA We can use R to produce stem and leaf displays or box plots and to plot the standard deviations versus the means to check assumptions. We can also use the lm model object created above to check the residuals. Here are some shorthand ways to get stem and leaf displays and boxplots: > tmp = tapply(y, group, stem) The decimal point is at the | 6 | 0 6 | 5689 7 | 3

The decimal point is at the | 5 6 6 7

| | | |

79 2234 7 0

The decimal point is at the | 4 5 5 6

| | | |

9 34 5578 03

The decimal point is at the | 5 | 699 6 | 13 6 | 69 2

The decimal point is at the | 5 5 6 6

| | | |

034 789 2 5

5.0

5.5

6.0

6.5

7.0

> boxplot(split(y, group))

1

2

3

4

5

It can be revealing to plot the means against the SDs to see if there is any pattern of variability. This can suggest an appropriate transformation. > plot(tapply(y, group, mean), tapply(y, group, sd), xlab = "means", + ylab = "SDs", pch = levels(group))

3

0.46 0.44

4 1

0.42

SDs

0.48

5

2

3 5.6

5.8

6.0

6.2

6.4

6.6

means

In addition, we can use the lm fit to the one-way ANOVA to check for patterns in the residuals. There are actually four possible plots, but we show only the first one here. > plot(drug.lm, which = 1)

Residuals vs Fitted

0.0

● ● ● ● ● ●

−0.5

Residuals

0.5



38● ● ● ● ●

30●









● ● ●



5.6

● ● ● ●



● ●

● ●









31●

5.8

6.0

6.2

6.4

Fitted values lm(formula = y ~ group)

4

6.6

11.9.3 Nonparametric one-way ANOVA The nonparametric one-way ANOVA can be performed using the kruskal.test command. We illustrate the latter with the fungus data from Section 11.7. > fungus = c(1.75, 1.25, 1, 2.75, 1.25, 2.5, 1.5, 3.75, 2, 1.75, + 2.5, 3, 2.75, 4.25, 3, 3.5, 2.75, 2.25, 4) > trt = c(rep(1, 5), rep(2, 8), rep(3, 6)) > kruskal.test(fungus, trt) Kruskal-Wallis rank sum test data: fungus and trt Kruskal-Wallis chi-squared = 8.479, df = 2, p-value = 0.01441 We can also use formula in this command. For instance, the nonparametric ANOVA for the drug problem is found by kruskal.test(V2∼V1,drug), yielding a p-value of 0.0015.

11.9.4 Levene’s test of unequal variance for groups We can compute Levene’s test for multiple groups using the following home-made command. It uses some subtle R commands and is probably only for the adventurous reader. This is a more general version of the two-sample Levene’s test shown in Appendix 10. > levene.test = function(data, v1 = "V1", v2 = "V2") { + levene.trans = function(data) { + ## find median for group of data + ## subtract median; take absolute value + a = sort(abs(data - median(data))) + ## if odd sample size, remove exactly one zero + if (length(a)%%2) + a[a != 0 | duplicated(a)] + else a + } + ## set up data frame with transformed data for anova + V2 = lapply(split(drug[[v2]], drug[[v1]]), levene.trans) + V1 = rep(seq(length(V2)), lapply(V2, length)) + levdata = data.frame(V1 = factor(V1), V2 = unsplit(V2, V1)) + ## perform one-way anova on transformed data + cat("Overall ANOVA for Levene’s Test\n") + print(anova(lm(V2 ~ V1, levdata))) + ## perform pairwise T-tests on transformed data + cat("\nPairwise Levene’s Tests\n") + pairwise.t.test(levdata$V2, levdata$V1, p.adjust = "none") + } 5

We illustrate this ANOVA version of the Levene’s test on the drug data: > levene.test(drug) Overall ANOVA for Levene’s Test Analysis of Variance Table Response: V2 Df Sum Sq Mean Sq F value Pr(>F) V1 4 0.04681 0.01170 0.1632 0.9554 Residuals 31 2.22208 0.07168 Pairwise Levene’s Tests $method [1] "t tests with pooled SD" $data.name [1] "levdata$V2 and levdata$V1" $p.value 2 3 4 5

1 2 3 4 0.9089767 NA NA NA 0.8863667 0.7812375 NA NA 0.5936598 0.4943337 0.6685501 NA 0.6893972 0.5793268 0.7812375 0.8638533

$p.adjust.method [1] "none" attr(,"class") [1] "pairwise.htest"

6

An Example of ANOVA using R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 In class we handed out ”An Example of ANOVA”. Below we redo the example using R. There are three groups with seven observations per group. We denote group i values by yi: > y1 = c(18.2, 20.1, 17.6, 16.8, 18.8, 19.7, 19.1) > y2 = c(17.4, 18.7, 19.1, 16.4, 15.9, 18.4, 17.7) > y3 = c(15.2, 18.8, 17.7, 16.5, 15.9, 17.1, 16.7) Now we combine them into one long vector, with a second vector, group, identifying group membership: > y = c(y1, y2, y3) > n = rep(7, 3) > n [1] 7 7 7 > group = rep(1:3, n) > group [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3

Descriptive Summaries Here are summaries by group and for the combined data. First we show stem-leaf diagrams. > tmp = tapply(y, group, stem) The decimal point is at the | 16 17 18 19 20

| | | | |

8 6 28 17 1

The decimal point is at the | 15 | 9 16 | 4 17 | 47 1

18 | 47 19 | 1

The decimal point is at the | 15 16 17 18

| | | |

29 57 17 8

> stem(y) The decimal point is at the | 15 16 17 18 19 20

| | | | | |

299 4578 14677 24788 117 1

Now we show summary statistics by group and overall. We locally define a temporary function, tmpfn, to make this easier. > tmpfn = function(x) c(sum = sum(x), mean = mean(x), var = var(x), + n = length(x)) > tapply(y, group, tmpfn) $"1" sum 130.300000

mean 18.614286

var 1.358095

n 7.000000

mean 17.657143

var 1.409524

n 7.000000

mean 16.842857

var 1.392857

n 7.000000

mean 17.704762

var 1.798476

n 21.000000

$"2" sum 123.600000 $"3" sum 117.900000 > tmpfn(y) sum 371.800000

2

ANOVA Table While we could show you how to use R to mimic the computation of SS by hand, it is more natural to go directly to the ANOVA table. See Appendix 11 for other examples of the use of R commands for ANOVA. > data = data.frame(y = y, group = factor(group)) > fit = lm(y ~ group, data) > anova(fit) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) group 2 11.0067 5.5033 3.9683 0.03735 * Residuals 18 24.9629 1.3868 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The anova(fit) object can be used for other computations on the handout and in class. For instance, the tabled F values can be found by the following. First we extract the treatment and error degrees of freedom. Then we use qt to get the tabled F values. > df = anova(fit)[, "Df"] > names(df) = c("trt", "err") > df trt err 2 18 > alpha = c(0.05, 0.01) > qf(alpha, df["trt"], df["err"], lower.tail = FALSE) [1] 3.554557 6.012905

Confidence Interval for Variance A confidence interval on the pooled variance can be computed as well using the anova(fit) object. First we get the residual sum of squares, SSTrt, then we divide by the appropriate chi-square tabled values. > anova(fit)["Residuals", "Sum Sq"] 24.96286 > anova(fit)["Residuals", "Sum Sq"]/qchisq(c(0.025, 0.975), 18, + lower.tail = FALSE) [1] 0.7918086 3.0328790 3

Comparison of Means Chapter 12 concerns comparing means after conducting an analysis of variance overall F-test. Here is a way to conduct pairwise t-tests. > pairwise.t.test(y, group) $method [1] "t tests with pooled SD" $data.name [1] "y and group" $p.value 1 2 2 0.29149074 NA 3 0.03444923 0.2914907 $p.adjust.method [1] "holm" attr(,"class") [1] "pairwise.htest"

4

13.9 Appendix: Example of the Use of R by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 Here we briefly indicate how R can be used to perform the chi-squared analysis for the test for independence using Mendel’s data on garden peas. The data, as discussed in Section 13.5, should be entered as a matrix. > mendel = matrix(c(38, 60, 28, 65, 138, 68, 35, 67, 30), 3, 3) Then we can use the chisq.test command to calculate the expected values and the X 2 value. > mendel.chisq = chisq.test(mendel) > mendel.chisq Pearson’s Chi-squared test data: mendel X-squared = 1.8571, df = 4, p-value = 0.762 We can examine the object mendel.chisq that we just created to find the expected values and contributions to the chi-squared: > mendel.chisq$expect [,1] [,2] [,3] [1,] 32.86957 70.69565 34.43478 [2,] 63.11909 135.75614 66.12476 [3,] 30.01134 64.54820 31.44045 > mendel.chisq$resid^2 [,1] [,2] [,3] [1,] 0.8007821 0.45887481 0.009277558 [2,] 0.1541331 0.03708776 0.011584746 [3,] 0.1347989 0.18458909 0.065994812 The expected values listed above are all greater than 5, and so the approximation is appropriate. The p-value is not significant, and all of the contributions to chi-squared are below 1. The above illustrates the use of R for testing independence. As we have stressed throughout this chapter, the chi-squared test for homogeneity is identical, although the inference is somewhat different. Thus, the R chisq.test command can be used to test both independence and homogeneity. 1

14.8 Appendix: R Output for the Samara Example by EV Nordheim, MK Clayton & BS Yandell, September 20, 2004 In this appendix we will briefly illustrate some of the regression commands available in R by using the samara data and the lm command. Note that lm allows for the possibility of having several predictors. This is important in multiple regression, a topic we will not pursue in this chapter. We have entered the data with x in column V1 and y in column V2. > samara = read.table("samara.dat") > x = samara$V1 > y = samara$V2 Alternatively, you can enter data as we have sometimes done: > x = c(1.72, 1.72, 1.77, 1.78, 1.82, 1.85, 1.88, 1.93, 1.96, 1.96, + 2, 2, 2.03, 2.06) > y = c(0.85, 0.86, 0.72, 0.79, 0.82, 0.8, 0.99, 0.94, 0.82, 0.89, + 0.95, 1, 0.98, 0.99) The R command plot produces a scatterplot of y versus x. The line > plot(x, y)





0.95









0.85

● ● ●

● ●



0.75

y





1.75

1.80

1.85

1.90

1.95

x

To regress y on x, we proceed as follows:

1

2.00

2.05

> samara.lm = lm(y ~ x) > summary(samara.lm) Call: lm(formula = y ~ x) Residuals: Min 1Q -0.10345 -0.03416

Median 0.00803

3Q 0.04917

Max 0.11057

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.1551 0.2992 -0.518 0.61355 x 0.5503 0.1579 3.485 0.00451 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.06605 on 12 degrees of freedom Multiple R-Squared: 0.503, Adjusted R-squared: 0.4616 F-statistic: 12.14 on 1 and 12 DF, p-value: 0.004506 > anova(samara.lm) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 0.052985 0.052985 12.144 0.004506 ** Residuals 12 0.052357 0.004363 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The object samara.lm contains all the information about the regression fit. The summary command provides the form of the regression equation, the estimates of the intercept “(Intercept)” and slope (“x”). Also provided are the estimated standard deviations (“Std. Error”) of these quantities, T values corresponding to H0 : b0 = 0 and H0 : b1 = 0, and p-values for these tests. R then prints an estimate of σe and R2 . The notation Adjusted R-squared refers to an adjusted version of R2 important in multiple regression. Finally, the anova produces the ANOVA table for the regression, including the F value and p-value for H0 : b1 = 0. You can add the regression line to the data plot using the following command. However, we will not show the plot here. > lines(x, predict(samara.lm)) 2

R has a number of definitions of residual suitable for different purposes. To produce a residual plot as we have defined it, we use the plot command. > plot(samara.lm, which = 1)

7● ● ●



0.00





● ●





● ●

−0.10

Residuals

0.10

Residuals vs Fitted

3●

9●

0.80

0.85

0.90

0.95

Fitted values lm(formula = y ~ x)

If we want to see all the predicted values and residuals, we can use the commands predict(samara.lm) and resid(samara.lm), respectively. Actually, there are four possible plots for lm objects. For instance, the second plot is the Q-Q plot: > plot(samara.lm, which = 2)

2 1

7● ● ●

0

● ●

−1

Standardized residuals

Normal Q−Q plot















9●

3●

−1

0

1

Theoretical Quantiles lm(formula = y ~ x)

R can also be used to obtain Yˆest and its estimated standard error and to obtain confidence intervals for Yˆest and Yˆpred . We do this below for the value x∗ = 1.80 by using the predict command. We use it twice to get confidence and prediction intervals. 3

> predict(samara.lm, data.frame(x = 1.8), se.fit = TRUE, interval = "confidence") $fit fit lwr upr [1,] 0.8354017 0.7857125 0.885091 $se.fit [1] 0.02280564 $df [1] 12 $residual.scale [1] 0.0660539 > predict(samara.lm, data.frame(x = 1.8), se.fit = TRUE, interval = "prediction") $fit fit lwr upr [1,] 0.8354017 0.6831462 0.9876571 $se.fit [1] 0.02280564 $df [1] 12 $residual.scale [1] 0.0660539

4

15.4.1 Appendix: Using R to Calculate Correlations by EV Nordheim, MK Clayton & BS Yandell, November 25, 2003 The correlation between x and y for the samara data (see Appendix 14.8) is 0.709, using the cor command in R. The cor.test command gives us the correlation along with a formal test. Both cor and cor.test can provide the classical Pearson correlation (method="pearson", the default), the Spearman nonparametric correlation (method="spearman") or Kendall’s tau (method="kendall"). We illustrate cor.test below on the samara data. > cor(x, y) [1] 0.7092115 > cor.test(x, y) Pearson’s product-moment correlation data: x and y t = 3.4848, df = 12, p-value = 0.004506 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2864037 0.9008190 sample estimates: cor 0.7092115 > cor.test(x, y, method = "spearman") Spearman’s rank correlation rho data: x and y S = 142, p-value = 0.008113 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.6883982 > cor.test(x, y, method = "kendall") Kendall’s rank correlation tau data: x and y z = 2.5894, p-value = 0.009613 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.5197823 1