Page 1. NCC501 Statistics for Management. Sample Size to Control Type II Error Probability. Hypothesis tests are usually designed with one goal: make ...
NCC501 Statistics for Management
Sample Size to Control Type II Error Probability Hypothesis tests are usually designed with one goal: make sure that the Type I Error Probability is α or lower. This does not give any specific protection against a Type II Error, “accepting the null hypothesis when it is false.” Calculating the Type II Error probability is a source of great confusion, and for this reason most people treat it as an “after-thought” in hypothesis testing. As a consequence, if the null hypothesis is not rejected, we are left in the uncomfortable situation of not knowing whether H0 is true, or whether it is false but the sample was too small to give a reliable test. In the latter case, “accepting H0” would be a Type II error. The key to controlling Type II error is the sample size. This note presents formulas that approximate how large the sample must be. To use these formulas you must make a judgment: You must specify a value of a population parameter (either µ or p) that represents the alternative hypothesis, one that is different enough from H0 to matter. This is a practical issue and has nothing to do with statistics. For example, no one cares if a bottlefilling machine is off by an average of 0.00002 ounces per sixteen-ounce bottle. However, if that gap were 0.1 ounces someone might be very concerned. It is your job to decide how large the gap must be to actually make a difference in the real situation. The gap between H0 and Ha is the main determinant of sample size. If the gap is narrow, we need a large sample to tell us which hypothesis is true. If the gap is wide, a much smaller sample will suffice. Simply put, it takes more information to distinguish between very similar alternatives than very different ones. The formulas for calculating the appropriate sample size depend on knowing how much variability there is in the population. This presents a chicken-and-egg dilemma: you can’t calculate the necessary sample size until you have an estimate of variability, but you can’t get an estimate until you collect a sample. An often-used solution to this dilemma is to do a small pilot study to estimate those values. Another method is to make an educated guess. In either case, the resulting sample-size calculation must be treated as an approximation. The benefit derived from the sample-size calculation comes in the interpretation of the results: If the null hypothesis is not rejected, you have strong evidence that “either H0 is true, or it is false by an amount too small to matter from a practical standpoint.” The sample-size formulas are based on the normal probability distribution, and certain other assumptions that may be approximations. Use them to determine the “order of magnitude” of the sample size, but do not treat them as exact. For example, if the recommended sample size is 34.5, do not worry about whether to use 34 or 35. “Round up” to an integer, and consider increasing the sample a bit more to be conservative.
Sample Size to Control Type II Error Probability
p. 2
One-Sample Test for a Mean (File: SampSize.xls, Sheet: 1-Mean) µ is the population mean. • µ0 is the value of µ under the null hypothesis. • µa is a value of µ that makes Ha true and is “just different enough” from µ0 to matter. • For a 2-tail test the value of µa could be on either side of µ0. Use either one in the n* formula. • s is the standard deviation estimated from the sample. If the population value (σ) is available, use it instead of s. This formula for n* assumes that σ has the same value under Ha as under H0.
( z α + zβ )2 s 2 (1) n* ≈ (µa − µ0 )
2
Use the Normal Distribution for zα and zβ. For a two-tail test use α/2.
Example: A bottle-filling machine has been set to produce an average of 16.3 ounces per bottle, in the long run. Legally, every bottle must contain at least 16 ounces. However, the filling process has a standard deviation of 0.06 ounces. If the machine were set at 16, half of the bottles would be underfilled. They set it at 16.3 to make sure that the rate of underfilled bottles is less than one in a million. Sometimes the machine goes out of adjustment. When the long-run average drops to 16.27 ounces per bottle, the rate of underfilled bottles exceeds 3 per million, a situation that management considers serious enough to stop the filling process and readjust the machine, although the lost production is a substantial cost. Their current sampling plan is to test 36 bottles and stop the machine if the sample’s average is below a given value. However, they are not sure that this plan gives error probabilities that are low enough. Management wants to be 95% certain that the machine will be stopped when its long-run average is 16.27 or lower, and to be 95% certain of NOT stopping the machine when its long-run average is actually 16.3. Is the sample size of 36 sufficient? This is a one-tail test because they stop the machine only when the sample average is below a specified level. • H0 is that the machine is running properly, so µ0. = 16.3 • Ha is represented by µa = 16.27, which is far enough below 16.3 to matter. • The standard deviation is 0.06. Since it describes the process rather than a particular sample, it is a population value, σ. • To achieve 95% certainty of avoiding both kinds of errors, α=0.05 for a one-tail test, so use zα = 1.645, and β = 0.05, so use zβ = 1.645.
( z α + zβ )2 s 2 (1.645 + 1.645) 2 0.06 2 = 43.3 n* ≈ = (µa − µ0 ) 2
0.03 2
Conclusion: To achieve 5% probabilities for Type I and Type II errors when the difference between “in adjustment” and “out of adjustment” is 0.03, the samples should include at least 44 bottles rather than 36.
Sample Size to Control Type II Error Probability
p. 3
Two-Sample Test for Means (File: SampSize.xls, Sheet: 2-Means) µ1 − µ2 is the difference between two population means. • d0 is the value of µ1 − µ2 under the null hypothesis (usually zero). • da is a value of µ1 − µ2 that makes Ha true and is “just different enough” from d0 to matter. • For a 2-tail test the value of da could be on either side of d0. Use either one in the n* formula. • s1 and s2 are the standard deviations from each sample. If the population values (σ1 and σ2) are available, use them instead of s1 and s2. (2) n* ≈
( z α + z β )2 (s12 + s 22 )
Use the Normal Distribution for zα and zβ. For a two-tail test use α/2. The recommended sample sizes are n1=n* and n2=n*.
(d a − d 0 ) 2
Example: The customer service department of a large corporation has begun a program to benchmark their service quality against their competitors. Their first effort was to measure how long it takes to reach a customer service representative at their “800” number. A pilot study was carried out. First they looked at their own system. Based on a sample of 30 calls, the average time was 2.5 minutes and the standard deviation was 0.9 minute. Calling one of their competitors 30 times resulted in an average of 2.7 minutes with a standard deviation of 1.1 minutes. After careful consideration, the benchmarking team decided that they should guard against making an error of more than 0.3 minute. That is, if the long-run difference in times were really 0.3 minute, they want to be 99% certain that their study will make the correct conclusion, which would be that the companies really differ. However, they want to be equally careful to draw the correct conclusion if the long-run average times do not differ. This is a two-tail test because they are only asking if they “differ” from their competitor. • The null hypothesis is “no difference” so d0 = 0. • For the alternative hypothesis, a difference of 0.3 or more is important, so da = 0.3. • Since they want 99% certainty of no error, both error probabilities are to be 1%. Thus, α = 0.01 for a two-tail test, so zα/2 = 2.576. Also, β = 0.01, zβ = 2.326. • From the pilot study we have preliminary estimates to use in the sample size formula: s1= 0.9 and s2 =1.1. n* ≈
( z α + z β )2 (s12 + s 22 ) (d a − d 0 ) 2
=
( 2.576 + 2.326 )2 (0.9 2 + 1.12 ) = 539.4 (0.3 − 0.0 ) 2
Conclusion: To achieve 1% for both error probabilities they should collect samples of at least 540 from each population.
Sample Size to Control Type II Error Probability
p. 4
One-Sample Test for a Proportion (File: SampSize.xls, Sheet: 1-Prop’n) p is the population proportion. • p0 is the value of p under the null hypothesis. • pa is a value of p that makes Ha true and is “just different enough” from p0 to matter. • For a 2-tail test the value of pa could be on either side of p0. Choose the one that is closest to 0.5. • The formula relies on the normal approximation to the binomial, so be sure to verify that n*p≥5 and n*(1-p)≥5 for both p0 and pa after you use it.
(zα (3) n* ≈
p0 (1 - p0 ) + zβ pa (1 - pa ) (pa − p0 ) 2
)2
Use the Normal Distribution for zα and zβ. For a two-tail test use α/2.
Example: The president wants to be informed when her “true” approval rating differs by more than 2 percentage points from 60%. She does not want to be notified of a shift unless the evidence is quite strong. However, she also does not want to be in the embarrassing situation of not having been notified when a real shift has occurred. The staff is about to commission a new survey of 100 voters. This is a two-tail test because notification is requested whenever there is a change in either direction. • Unless notified otherwise, she assumes a 60% approval, so p0 = 0.6. • Because H0 is two-tailed, the alternative hypothesis could have either pa = 0.62 or pa = 0.58; we use pa = 0.58, the one closer to 0.5. • No values are given for error probabilities. We will use 5% so that zα/2 = 1.960 and zβ = 1.645.
(zα n* ≈
p0 (1 - p0 ) + zβ pa (1 - pa ) (pa − p0 ) 2
)2 = (1.960
.6(.4) + 1.645 .58(.42) 0.022
)2 = 7850.1
Conclusion: To achieve 5% error probabilities when the difference between “current approval” and “new approval” is 0.02, a sample of at least 7851 observations is needed.
Sample Size to Control Type II Error Probability
p. 5
Two-Sample Test for Proportions (File: SampSize.xls, Sheet: 2-Prop’ns) p1 − p2 is the difference between two population proportions. • d0, the value of p1 − p2 under the null hypothesis, is assumed to be zero. • da is a value of p1 − p2 that makes Ha true and is “just different enough” from zero to matter. • p1 and p 2 are the proportions calculated from the samples. n p + n 2 p2 • p= 1 1 is the pooled estimate. If you have no estimate of p , use an educated guess. n1 + n 2 • The formula relies on the normal approximation to the binomial, so be sure to verify that n* p ≥5 and n*(1- p )≥5 after you use it.
⎛ z 2p(1 - p) + z 2p(1 - p) - 0.5d 2 ⎞ ⎜ α β a ⎟ ⎠ ⎝ (4) n* ≈ 2 da
2
Use the Normal Distribution for zα and zβ. For a two-tail test use α/2. The recommended sample sizes are n1=n* and n2=n*.
Example: Consumers United is testing to see whether there is a difference in the results from two pollsters. Each pollster sampled 1000 voters, asking “would you would vote for the current president if the election were held today?” One pollster’s result was 37% and the other was 43%. Are the samples large enough to maintain 5% or lower probabilities for both error types? This is a two-tail test because they are only asking if the polling methods “differ”. The null hypothesis is “no difference” we can use the method above. CU has specified that a difference of 0.03 or more is important, so that value is p1−p2 for Ha. For a two tail test, zα = z0.025 =1.96 and zβ = z0.05 = 1.645. First calculate the “pooled” value of the sample proportion, p : n p + n 2 p2 1000(0.37) + 1000(0.43) p= 1 1 = =0.4 20000 n1 + n 2
Then calculate the recommended sample size: ⎛ z 2p(1 - p) + z 2p(1 - p) - 0.5d 2 ⎞ ⎜ α β a ⎟ ⎝ ⎠ n* ≈
2
d a2
2
⎛1.96 2(.4)(.6) + 1.645 2(.4)(.6) - 0.5(.03) 2 ⎞ ⎜ ⎟ ⎠ = 6927.5 =⎝ 0.03 2 Conclusion: Since this is much larger than the published samples, if Consumers United uses the samples of 1000 to test the difference between the two pollsters at α=0.05, they face a Type II error probability that is much larger than 5%.