This has led to formulating Gauge R&R guidelines by AIAG and others for various schemes in the tactical use of gauges to measure product characteristics.
Determining Sample Size in Statistical Experiments—the Computational Approach Abstract With today’s drive for Six Sigma DMAIC, the issue of measurement errors in experimentation is being widely debated. As a result naive to theoretically sophisticated sample sizing suggestions have emerged. This has led to formulating Gauge R&R guidelines by AIAG and others for various schemes in the tactical use of gauges to measure product characteristics. This study extends that effort to expedite and even possibly automate sample sizing decisions, thus conveniently linking the overall power of ANOVA tests in statistically planned experiments to the number of treatments employed, replications, and measurements taken on each part. Patnaik’s (1949) work is invoked here. It is discovered that sample sizes may be quickly computed with it and the results closely match those determined using classical OC curves. The computational approach is noted to be particularly helpful in what-if studies to assist in optimizing the different interacting sample sizes while planning DOE—to deliver a pre-stated treatment detection power.
Keywords
Design of Experiments; ANOVA; Sample Size; Measurement Errors;
Noncentral F; Computational Methods
Introduction Patnaik (1949) observed that in the Neyman-Pearson theory of testing statistical hypotheses the efficiency of a statistical test should be judged by its power of detecting departures from the null hypothesis. Patnaik then presented a methodology that provides approximate power functions for variance-ratio tests that involve the noncentral F distribution. These results focused on deriving approximations to the noncentral F (called F ) distribution to permit one to calculate, for example, the probability of rejecting the null hypothesis when it is false in a variance-ratio test with specified significance (α), deviation and the number of replicated observations.
Subsequently these approximations led to the development of the classical Person-Hartley tables and charts that relate the power of the variance-ratio test to a specified deviation or noncentrality (Pearson and Hartley 1972). The data are analyzed by ANOVA (Dean and Voss 1999; Montgomery 2007). The present work extends these results to help one expediently tackle situations where observations are imperfect implying that the data are affected by measurement and other errors, hence determining the correct sample size here would be inescapable. Applications of statistically designed experiments currently abound, not only in R&D studies and academics, but also in process improvement practices and strategic interventions such as the Six Sigma DMAIC (Pyzdek 2000). As Montgomery notes, experiments are performed as a test or series of tests in which purposeful changes are made to certain experimental factors to observe any changes in the output response. The observed data are statistically analyzed to lead to statistically valid and objective conclusions at the end of the investigation. Perhaps the only guidelines on judging the adequacy of measuring systems that are available to practitioners of quality control are those by AIAG. Still, to experimentalists the AIAG guidelines are grossly inadequate for they do not tell how much data should be collected to lead to credible and statistically defensible conclusions. As a result, for example, many experiments use as few as only two replications while some clinical trials use too many (Donner 1984). At times the reporter is even silent about measurement errors and uses only a single measurement per sample.
The Research Questions: 1. HOW DO WE EFFICIENTLY ASSESS THE EFFECT OF MEASUREMENT ERRORS ON THE RESULTS OF ANOVA?
2. CAN WE EXPEDIENTLY DETERMINE THE CORRECT SAMPLE SIZE (NUMBER OF REPLICATIONS OR REPEATED MEASUREMENTS REQUIRED ON EACH PART EXPERIMENTALLY PRODUCED) IN
DOE?
Background and Earlier Work As Simanek (1996) said, no measurement is perfectly accurate or exact, hence even in conducting designed experiments we can never hope to measure true response values. The true
value is the measurement if we could somehow eliminate all errors from instruments and hold steady all factors other than those being experimentally manipulated. Generally such observed data combine in them what is commonly called “experimental errors” that include the effect or influence of all factors—other than factors that the investigator manipulates. Improperly designed experimental studies conducted in such operating environment introduce unknown degrees of errors in conclusions drawn from them. Indeed a measured or experimental observation is of little use if nothing is known about the size of its error content. The flaws of the ramshackle ways for conducting multi-factor investigations was recognized formally by Fisher (1928), who introduced the principles behind planning experiments to guide empirical studies. Various enhancements followed Fisher’s work. Process sample sizes for the ANOVA scheme for measurement error-free conditions (measurement errors being distinct from the experimental errors noted above) were studied by Fisher (1928), Tang (1938), Patnaik (1949), and Tiku (1967) and several others. Sample size specification is one of the first questions an investigator faces in designed experimental studies (DOE). He needs to know, “how many observations do I need to take?” or “Given my limited budget, how can I gain as much information as possible?” However, methods for sample size determination in DOE (that do not sometimes consider measurement errors) abound. Pearson and Hartley (1972), Dean and Voss (1999) and Montgomery (2007) describe the theory behind these. They regard 2—the variability caused by all factors (including measurement variability) not in the experimentalist’s control—as a key determinant of sample size, and thus detection power. We note here the distinction that exists between errors in data introduced by an imperfect measurement system in use as opposed to those caused by uncontrolled experimental conditions.
A Quick Recall of One-way ANOVA An investigator resorts to experimentation except in two situations. The first are ones when he is unable to identify the key factors that might be influencing the response, as in the domains of economics, meteorology or psychology. Here a DOE scheme cannot be set up; one must merely observe the phenomena and look for any notable relationships. The second situation is when our knowledge has already progressed to the point that a theoretical model of the cause-effect relationship may be developed from theoretical considerations alone, common, for instance, in
electrical sciences. In remaining situations whose number is large one approaches the query through scientifically planned empirical investigations. Therefore use of the DOE—a methodology that is based on sound statistical principles—is the usual framework today in metallurgy, chemistry, new product development, drug trials, process yield improvement and even in the optimization of computational algorithms. We use conventional notations in the rest of this paper. Let the study involve only one controllable factor “A” with a distinct treatments; trials are replicated at each treatment level n times. Such an investigation is called a “one-way” method. The observed data are {yij } representing the value of the response noted at treatment i and replication j. We define n
yi. yij
ybari. yi. / n
j 1
a
y.. i 1
n
y j 1
i = 1, 2, …, a
ybar.. y.. /( n a)
ij
The observed response data {yij} are analyzed using one-way ANOVA based on the assumption that the response y is possibly affected by the varying treatment levels of “A” and also by the uncontrolled factors. The relationship is modeled as yij = i + ij,
i = 1, 2, …, a; j = 1, 2, …, n
(1)
where i is the treatment mean at the ith treatment level and ij is a random error incorporating all other sources of variability including measurement errors and uncontrolled factors. Relationship (1) may also be written as the means model, i = + i, i = 1, 2, …, a, a procedure that converts (1) into the ith treatment model or effects model, yij = + i,+ ij,
i = 1, 2, …, a; j = 1, 2, …, n
Treatment effects {i} of factor A are evaluated by setting up a test of hypothesis, with H0: 1 = 2 = … = a H1: j for at least one pair (i, j) of treatment effects.
(2)
Since is the overall average, it is easy to see that 1 + 2 + … + a = 0. The hypotheses H0 and H1 may be then re-stated as H0: 1 = 2 = … = a = 0 H1: i for at least one i. To complete ANOVA one uses quantities defined and computed as follows. a
SSTreatments n ( ybari. ybar.. ) 2 i 1
MSTreatments = SSTreatments/(a-1) a
n
SSTotal ( yij ybar.. ) 2 i 1 j 1
SSE = SSTotal – SSTreatments and MSE = SSE/(n(a – 1)) The test statistic F0 (when hypothesis H0 is true) is defined as MSTreatments/MSE. F0 follows the F distribution with degrees of freedom (a – 1), n(a – 1). If the null hypothesis were true, this statistic would have a value Fwhen α is the significance (the probability of rejecting H0 when it is true, or a type I error) of the test. A type II error is committed in the one-way ANOVA procedure when H0 is false, i.e., when at least one treatment effecti is not zero, yet due to randomness F0 F. The probability of committing a type II error is denoted by , and the quantity (1 – ) is called the detection power of the test. The power of a statistical test is defined as the ability of the test when it should reject the null hypothesis when the null hypothesis is false. This present study delves into the rapid and clean valuation of power and sample size in statistical experiments when these quantities are evidently related to replication (which attempts to seize process or part variability) and also the number of measurements taken on each part experimentally produced.
The Issue of Sample Size in One-way DOE To have predictive power, statistical experiments require one to pre-specify sample size—how many observations must be obtained under a given set of conditions. In establishing sample size for experiments on cow milk yield, Gill (1968) confronted a problem typical in sample size determination in designed experiments. Gill’s problem was a one-way test in which lactational records of Holsteins, Brown Swiss, Ayrshires and Guernsey cows were used to determine how many cows should be milked to establish differences between the largest and smallest yield means. Two classes could be identified due to the difference in standard deviations of yield reported—Holsteins and Brown Swiss had a standard deviation of 4.5 kg, while 3.3 was the number for Guernsey and Jersey. The goal was to determine the number of cows to be milked to detect true mean differences between each “treatment level” (cow type) compared. The test should have 0.5 or 0.8 power (50% or 80% chance) of detecting the specified true mean difference in daily milk yield. Gill used the Pearson-Hartley tables to resolve that to achieve a power of 80% 37 cows each of Holsteins and Brown Swiss would be required. That number for comparing Guernsey and Jersey was 20 cows each, due to the latter pair’s lower standard deviation. Nevertheless, note that Gill did not have estimates of milk yield measurement errors when the same cow was milked on different days and milk yield measured. Thus, he was forced to lump all sources of yield variation including measurement errors (other than that due to treatment) into error variation. Could the yield estimates be more precisely compared? We do not have information to answer that. Sample size specification concerns investigators for some important reasons. The study must be big enough so that an effect of scientific significance will also be statistically significant. If sample size is too small, the investigative effort can be a waste of resources for not having the power or capability of producing useful results. If size is too large, however, the study will use more resources (e.g. cows of each genre) than necessary. Hoenig and Heisy (2001) remark that in public health and regulation it is often more important to be protected against erroneously concluding that no difference exists when one does. Lenth (2001) comments on the relatively small amount of published literature on this issue—other than those applicable for specific tests. Finding sample size explicitly attempts to link the power of the test with sample size as follows.
1. Specify a hypothesis test on the unknown parameter . 2. Specify the significance level of the test. 3. Specify the effect size the detection of which is of scientific interest. 4. Obtain historical estimates of other parameters (such as experimental error variance 2 ) needed to compute the power function of the test. 5. Specify a target power (1 - ) that you would like to get when is outside the ± range. Lenth provides an insightful discussion of practical difficulties. He reminds, for instance, that instead of asking directly “How big a difference would be important for you to be able to detect with 90% power using one-way experiments?”, one should use concrete questions such as “What results would you expect to see?” The answer may lead to upper and lower values of the required sample size. For determining sample size, we first recall the steps given by Montgomery (2007) based on the methods given by Pearson and Hartley (1972), and then the noncentral F approximation due to Patnaik (1949).
Determining Sample Size in One-way ANOVA Montgomery (2007) determines sample size for the “fixed effects model” by invoking the Pearson-Hartley (1972) power functions as follows. The power of a statistical test is defined as (1 – ) where is the probability of committing a type II error, given by = 1 - P[reject H0 | H0 is false] = 1 – P[F0 > F, a-1,(n-1)a | H0 is false]
(3)
Critical in evaluating (3) is our knowing the distribution of the test statistic F0 if the null hypothesis H0 is false. Patnaik (1949) called the distribution of F0 (= MSTreatments/MSE when H0 is false) a noncentral F random variable with (a – 1) and (n – 1)a degrees of freedom and a noncentrality parameter (Dean and Voss, 1999, page 51). When = 0, the noncentral F becomes the usual F distribution. Pearson and Hartley (1972) produced OC curves that relate with another noncentrality parameter (which is related to ) defined as
a
2
n i2 i 1
(4)
a 2
Pearson-Hartley curves help one determine given = 0.05 and 0.01, and a range of degrees of freedom for the noncentral F statistic. In determining sample sizes the starting point is the specification of . The curves also require the specification of experimental variability 2. In operationalizing the procedure, several methods are used. If one is aware of the magnitude of the treatment means {i} and an estimate of 2 (as in Gill’s milk yield tests), and a and n, one can directly compute 2 and using the degrees of freedom, read off (hence power of the test) from the Pearson-Hartley curves. If, however, the treatment means are unknown but one is interested in detecting only an increase in the standard deviation of a randomly chosen experimental observation because of the effect of any treatment, to use the charts one uses a
i2 / a i 1
/ n
=
n (1 0.01P) 2 1
(5)
where P is the percent (%) specified for the increase in standard deviation of an observation due to treatment effect beyond which one wishes to reject H0 (the hypothesis that all treatment effects are equal). Montgomery (2007) provides several numerical illustrations, and summarizes methods usable also for two-factor factorial fixed effect (constant sample size) designs. It would be straightforward to determine Gill’s (1968) number of required cows problem using those curves. For that study, given = 0.05, = 0.8, maximum difference detection capability of 3 kg, two treatment levels, and experimental variability = 4.5 kg one finds that one would require to use 38 cows of each genre to compare Holsteins and Brown Swiss.
Measurement Errors and Repeat Measurements A well-known assertion is often cited—increase in sample size at replication level and repeat measurements raise the power of treatment detection. Still, to actualize this, experimental error
variability 2 is rarely measured explicitly. It is the push by Juran and other gurus that has led to plants doing gage R&R studies (Juran and Gryna 1993; AIAG 2002). Sample sizing theory is plentiful; Duncan (1974), for instance, has vigorously discussed sample sizing theory chapter after chapter. Duncan demonstrated the use of the Pearson-Hartley power charts (Pearson and Hartley (1972)). Still, the restricted or even naive practice is common, possibly due to somewhat cumbersome interpretations that the charts sometimes require. Suggestions only to “keep their impact low” wherever measurements are used are clearly inadequate, for this says little about the size of the process or measurement samples to be used. AIAG guidelines state that measurement variances quantified as 2Gauge R&R “up to a third of the variance of the process being measured” are “acceptable” (AIAG 2002; Wheeler 2009). Gryna, 2 Chua and Defeo (2007) remark that when measuremen t variability and therefore the total standard
deviation of repeatability and reproducibility is known, one must judge the adequacy of the measurement process. Authors cite rules-of-thumb that practitioners commonly use here, such as if 5.15 measurement var iability specification range of the quality characteristic being measured, the measurement process is acceptable, not otherwise. However, such rules beg rendering them to be more precise. Note that even at this level of measurement system performance one has little idea about detectability (based on ) and the extent to which type II errors are committed by QC and R&D staff. The inadequacy of such guidelines is self-evident if one looks at the different ways the observed experimental data (the input)—even in a carefully completed one-way ANOVA— can be affected. This is shown in the simplified picture in Figure 1; notations appear in Table 2. 2 treatment 2 true partto part or process variability
2 total variability 2 observed process variability
2 measuremen t variability
Figure 1 Constituents of Observed Variance in Experimental Data
Table 2 Variability in Experimental data Variance
Source of variability in data
2 total variability
Total data variability resulting from all sources of variation
2 treatment
Variability caused by treatment effects
2 true part to part variability
True process variability
2 observed process variability
Observed variability in experimental data, 2 in this paper Variability caused by the measurement system—includes repeatability,
2 measurement variability
bias, stability and reproducibility
Thus the raw data contains a total variability that under certain assumptions, Fisher (1966) showed, could be decomposed into several different sources of variability (see Montgomery, 2007, page 519). Such breakdown of variability and minimizing their impact by correct sample sizing could lead one to optimally allocate the available resources to DOE involving replication (Duncan, 1974, p. 619). The classical OC curves and tables by Pearson and Hartley remain today’s anchor for sample size determination in statistical experiments. Their use by the majority of experimentalists continues to remain limited, however. The present work appraises Patnaik’s approximations as an alternative method to find sample size. The goal would be to test this method’s efficacy in linking type II errors, and (a) number of treatments used in an experiment analyzed by one-way ANOVA, (b) the number of replicated process samples (parts) required at each treatment, and (c) the number of measurements to be made per part. The motivation behind this is that Patnaik’s method permits direct and explicit calculation of risk, rather than one’s having to interpret charts that sometimes get difficult to read and interpret.
Audacious Extraction of Replication and Measurement Sample Sizes A casual look at Figure 1 will suggest that the different effects of the constituent variances of the overall (total) variance of the data produced by a measurement system are additive; it ignores interaction and higher order effects. Interactions between operators, parts, and measurements, actually do exist and can be estimated, even if small. However, when the power of treatment detection (1-β) is specified, given other things the number of replications and the number of measurements made/part cannot vary independently—their values also interact. Still, a practitioner may choose to ignore interactions and instead focus on boldly unmasking the two critical sample sizes (replications, and the number of repeat measurements) to ensure a pre-stated level of type II risk protection. In the simplistic illustration that follows we assume that a singlefactor experiment is being conducted and the data analysis uses one-way ANOVA. The overall protection this procedure would give, assuming operational isolation between the application of treatments to the units of the experiment and the measurements to which these units are later subjected (see Duncan (1974), page 731), and also no interaction between the value of the response observed and the measurement process, is Overall Power of the experiment = (1 - treatment)(1 - measurement)
(6)
Outline of the Procedure We define and set values for certain decision parameters. Specifically, Dt is defined as the minimum detectability of the experimental design when the response being studied is subjected to different factor settings (treatment levels). If process 2 variability observed process variability is known, for a given power (= 1 - treatment), one may use a
standard procedure (e.g., Section 5.3.5, Montgomery (2007)) to find p—the required number of parts (process replications) needed in the experiments. The next step is to define Dm, the minimum detectability of part size (or response value) differences that the measurement system in use is able to distinguish. Again, if measurement
system variability measurement var iability is known, for a given power (= 1 - measurement), one may use a similar procedure to find m—the required number of repeat measurements to be taken on each part (unit process sample) needed in the experiments. Ideally, Dm should be about an order of magnitude less than Dp (page 371, Montgomery (2012)) to allow for rounding errors. In practice, as stated earlier, “rules-of-thumb” frequently set this near ¼ of Dp (AIAG (2002)). For the present, we would define an acceptable quantity d such that Dm = Dt/d. Given p and m, the overall protection against falsely accepting that a treatment effect exists when in fact it does not, assuming that the two procedures—conducting the experiments, and taking measurements on the replicated process units produced under each treatment—are independent is Overall Power of the experiment = (1 - treatment)(1 - measurement) To operationalize this procedure one would need to know (1) how many parts to produce during a treatment, and (2) how many measurements to take per part or unit process sample. Figure 1 provides the basis for answering these two questions and the related sensitivities either by power charts—or computationally, if one invokes, for instance, Patnaik’s approximation for the noncentral F, as we see later. We first use the chart based method and set up a data table to incorporate repeated measurements. In this scheme each part produced will be measured m times. Thus, instead of a single observation yij in the (i, j) cell as shown in Table 1 we shall have m entries in it as yijk, i = 1, 2, …, a; j = 1, 2, …, p and k = 1, 2, …, m Here p parts are produced in each treatment. Thus the two sample sizes of interest here are p (the number of replicated parts produced in each treatment) and m (the number of measurements made per part). These two quantities are our decision variables in optimizing the overall type II error or the overall power of the experiment. The following procedure can estimate the required number of measurements/part when the 2 measurement system variance measuremen t variability is not known. The goal here would be to have
sufficient power through a minimum number (m) of measurements that would be able to detect
through measurements a change of P% in the average of measured data due to variability in measurements. Using (5) which was obtained from Montgomery (2007, page 109), also given in Pearson and Hartley (1972, page 159), one finds m (1 0.01P) 2 1
For a numerical problem in which the number of parts used in each treatment (p) and the desired significance are known, for given m and P one can find the corresponding value of and then the probability of committing a type II error () from the Pearson-Hartley charts given in Montgomery (2007, page 647 onward), or from Pearson and Hartley (1972, Table 30). For example, if p = 5, P = 10% and a trial value of m is assumed to be 5, one computes the degrees of freedom for the noncentral F to be 1 = p – 1 = 4 and 2 = m(p – 1) = 20. These quantities lead to =
5
(1 0.1) 2 1 or 1.025. Let = 0.05. Then from the Pearson-
Hartley chart one obtains 1 - = 0.55 yielding a power of 0.55. Similarly, by assuming m = 10 and visually interpreting the curves one obtains a power of 0.72. For m = 20 one obtains a power of 0.96. This process of gradually adjusting the trial value of m can continue till one conservatively reaches the target power. Similarly the number of parts required to detect a change in average part dimension due to treatment effects may be found. Suppose we wish to detect a Q% increase in the standard deviation of the variability due to treatment effects. The problem then becomes one of finding the required number of parts. In this case p (1 0.01Q) 2 1
Again, given the number of treatments a, one finds 1 = a – 1 and 2 = p(a – 1). When significance is specified, for a starting value for the number of parts (p) one can calculate , and then the corresponding power of detecting treatment effects (1 - ) from the Pearson-Hartley charts. The process may be repeated with different values of p till a value is located that can deliver the required target power.
The foregoing (“percentage increase method”) method has one important utility. It assures specified powers for the experiment’s ability to detect average part size differences, and also separately detect treatment effects on those parts—without the knowledge of measurement system variability or process variability. The detection targets embedded in the forms of P or Q assure catching a relative increase of the standard deviation of observations, separately for measurements, and for the needed replications. However, frequently it is necessary to determine the overall error one may commit or the risk () one might be exposed to, while conducting designed experiments with repeated measurements. For this it is necessary to obtain prior 2 estimates of measurement variability measuremen t variability and the observed process or part-to-part 2 variability observed process variability . The procedure would be as follows.
One begins with (4). Montgomery (2007, page 108) affirms a very useful result—that in oneway experiments the smallest value of 2 corresponding to a specified difference Dt between any two treatment means is given by
2
2
nDt 2a 2
(7)
Dean and Voss (1999, page 52) provide the proof of (7). Recall that the distribution of F0 (= MSTreatments/MSE when H0 is false) is noncentral F with (a – 1) and (n – 1)a degrees of freedom. Hence, given the number of treatments a, the power of the test would depend on sample size n and Dt and would be evaluated using the noncentral F distribution, F . Power itself is evaluated as 1 - treatment = P[MSTreatments/MSE > F’a-1, (n – 1)a, ] Here F a-1, (n – 1)a, is the critical value of the noncentral F random variable for significance . Thus the steps to determine the power of the one-way ANOVA as a function of the number (n) of parts or process samples would be 2 1. Estimate observed process variability , the observed part-to-part variance from historical data , or
by the method given in Duncan (1974, page 612).
2. Decide on number of treatments a, treatment effect detection criterion D and significance . 3. Select n. Hence for the noncentral F distribution to be used, the degrees of freedom would be (a – 1), and a(n – 1). 2 2 2 4. Using observed process variability for and (7) calculate the noncentral quantity , and
hence. 5. Use the appropriate Pearson-Hartley curves (Pearson and Hartley 1972, Table 30) to read off the power of the test—the probability of rejecting the hypothesis (that all a treatment effects are identical) at detection criterion Dt. In finding the minimum sample size n to achieve a target power (1 - ) of the one-way ANOVAthe steps 3 to 5 are recursively exercised. One begins with a trial value of n and from the power charts determines the corresponding power. If the result is lower than the target power desired, one increases n in successive iterations of the steps above till the resulting power target power. 2 Note from Figure 1 that the observed process (i.e., part-to-part) variance observed process variability
includes the effect of measurement errors, hence the number of parts (the required replications) thus estimated is conservative (higher than what would result from using true part-to-part variance). We point now to another method for estimating the risk β that is computational and does not require one to invoke the Pearson-Hartley power curves of tables.
Patnaik’s (1949) Approximation for the Noncentral F Distribution Patnaik noted that the power function (1 - in our notation) for the F distribution may be used to determine in advance the size of experiment (number of samples required) to ensure that a worthwhile difference () would be established as significant—if it exists. However, he particularly remarked that the mathematical forms of these distributions were long known, but
due to their complexity, computing the power tables based on them was not easy. To this end Patnaik developed approximations to the probability integrals involved here and coined the terms noncentral 2 and noncentral F. The noncentral F is denoted by Patnaik as F , which he approximated by fitting an F distribution whose first two moments are identical to those of F , a procedure that was subsequently adopted to construct the power and operating characteristic curves of ANOVA power functions by Pearson and Hartley (1972). Only the parts of Patnaik’s procedure relevant to the present context are extracted below. For a detailed description the reader is referred to Patnaik (1949) and to the text portions of Pearson and Hartley. As just stated, Patnaik provided an approximation to the noncentral F (i.e., F ) distribution that leads one to directly evaluate the critical value of F using statistical built-in functions in software tools such as Excel for the standard F distribution. This thus bypasses the need to visually interpret quantities from Table 30 of Pearson and Hartley (1972) or from the power charts. This method, described later, was expansively used in the present study to help estimate the different required sample sizes. Patnaik suggested that the noncentral F distribution ( F ) can be approximated by a standard F distribution with some small adjustments such that the first two moments of the two distributions are identical. The massive utility of this approximation is that it saves a great deal of computational effort (and even automates some steps) and interpretation of curves in charts, and is almost identical to the original exact quantities for most practical purposes (Person and Hartlety 1972, page 67; Tiku 1966; Pearson and Tiku 1970). The approximation says
F (1, 2 , ) is nearly equal to kF( , 2 ) where k and are determined so as to make the first two moments of the approximation identical. Patnaik determined that k
1 1
and
( 1 ) 2 . 1 2
In these expressions the quantity is used. It is evaluated from the relationship /( 1 1) as shown in Patnaik (page 223) and Pearson and Hartley (1972, page 68). Thus we have
2 ( 1 1) . kF(ν, ν2) yields 1 - while sample sizes are linked to ν1 and ν2, as shown below.
Application of Patnaik’s Results to Sample Size Determination Recall the conditions stated earlier under which the one-way ANOVA with replicated measurements is conducted—the number of treatments is a, the number of replications—process samples (parts) observed in each treatment in the fixed effect scheme of running the experiments—is p (p 1), and each part is measured by using the measurement system available m times (m 1). Our objective presently is to determine the minimum number of parts that must be produced to assure sufficient power (1 - treatment) at the treatment effect detection stage, as well as the number of measurements that must be obtained on each part to ensure sufficient power (1 - measurement) in correctly determining the size of each part measured. As is often the practice, we assume that part sizes and their measurements do not interact. Therefore the two powers when multiplied—each being independent of the other as they are statistically isolated— result in artlessly estimating the overall power of the empirical study as stated by (6). Patnaik’s results may be used in this computation as follows. Finding the required number of process samples to be collected at each treatment to assure a power of (1 - treatment) proceeds as follows. To begin one assumes a desired significance , the 2 number of treatments a, the variance observed process variability (estimated from historical process
output data—this includes a small contribution of measurement errors), and the target treatment effect detection capability Dt. Next a trial value of part number p is assumed and the noncentral quantity 2 (which is based on (7)) is calculated from the relationship
2
pDt
2
2 2a observed process var iability
Next the degrees of freedom 1 (= a – 1) and 2 (= a(p – 1)), ( 2 ( 1 1)) and then k and are determined using Patnaik’s relationships
k 1 1
( 1 ) 2 and . 1 2
As the last step, the power delivered by p is calculated using the relationship (Patnaik 1949, equation (66))
Power (1 treatment ) F1 ( , 2 )
1 1
F1 ( 1 , 2 )
The quantity F1 ( 1 , 2 ) above is the one-sided critical value of the F random variable from the F1 ( 1 , 2 ) distribution that has (1 - ) area under the F-density to the left of that point. This
quantity can be determined, for instance, by the built-in statistical function FINV(.., .., ..) in Excel. (We note that in the notations used, the present paper uses the conventional notation of for P[type II error] (Juran and Gryna 1993) while Patnaik (1949) and Pearson and Hartley (1972) use “” for power function or the ability of the test to reject the null hypothesis when it is false.) Thus, given p, treatment effect detection criterion Dt, significance and observed historical 2 process variability represented by variance observed process variability (obtained by the procedure given
by Duncan (1974, page 612)) one can approximately calculate the power at the treatment effect detection level. Since measurement variability is included in the observed process variability, for given p, the power estimated as above is conservative (lower than what is possible if measurement variability was eliminated). The number of measurements (m) required to be taken in on each part produced to ensure that part sizes are measured with an acceptable level of precision may be similarly found. The following changes would be necessary. The new required quantities are defined as p = number of parts produced per treatment Dm = target maximum detectability by measurement of the difference between two part sizes
a = number of treatments m = number of measurements to be taken on each part, and 2 measuremen t variability = variance observed in measured data, estimated from gauge R&R studies done
separately What we have done above is to substitute the step of physically looking up Pearson-Hartley charts or related tables by a “built-in” computation, readily available in today’s tools such as Excel®. It is impressive that such tools can allow us now to readily exploit powerful theoretical results stockpiled decades back. Given p and m, the above procedure can allow us to artlessly compute the overall power of detecting treatment effects for given detectability criteria Dt and Dm. The serious weakness of this approach is that the effects of p and m are not independent and their individual effects on the overall power of the one-way ANOVA is not additive. We introduce an innovative approach below to find both p and m in one step. For this we use the two-way design of factorial experiments layout in a modified way.
Finding sample sizes p and m in one step It is possible to obtain “sample sizes” p and m in one step, and also find the sensitivity of an experiment’s overall treatment detection power to these two sample sizes—with a small procedural modification. Consider the two-way experimental layout shown in Table 3, involving factors A and B, with n (= m in earlier notations) repeat measurements carried out on each unit process sample replicated. A key change to relate the required number of measurements/part to the power of detecting the effect of B made here is how the noncentral quantity 2 is defined. In this case, to separate (a) the effect of B, (b) part-to-part variability, and (c) process variability due to measurement and other factors we draw a mapping from 2-way ANOVA as follows.
Table 3 The Two-factor DOE Data Table (Table 5.2, Montgomery 2007)
Factor A
Factor B …
1
2
b
1
y111, y112,…, y11n
y121, y122,…, y12n
y1b1, y1b2,…, y1bn
2
y211, y212,…, y21n
y221, y222,…, y22n
y2b1, y2b2,…, y2bn
ya11, ya12,…, ya1n
ya21, ya22,…, ya2n
yab1, yab2,…, yabn
… a
For the case of two-way factorial design, given DB (the difference in any two column means) and n replicates, the minimum value of 2 is (Montgomery 2007, (5.18))
2
n a DB 2b 2
2
where a = number of treatments of Factor A b = number of treatments of Factor B DB = target maximum detectability difference between two column means n = number of replications for each treatment A and treatment B combination 2 = variance of model errors caused by factors other than A and B (Montgomery 2007, page 64) We re-define the terms in the two-way data table as follows. The design of interest is a one-way factorial design with b treatments applied with factor B, p replicated parts produced per treatment, and n measurements taken per part. Treatment detectability for B desired is Dt whereas 2 represents all variability attributable to all factors other than factor B and part-to-part or process variability. For this case, given Dt (the difference in any two row means) and p replicates, minimum 2 will be
2
2
n p Dt 2b 2
Table 4 shows the corresponding data table. Thus we use here a two-way analog of the one-way factorial design in which b treatments are involved, p parts are made (by replication) at each treatment level of the factor, and n measurements are made of each part. In the above expression of 2, variability 2 is due to all factors including measurement effects but it excludes the effect of B (variation chronicled between rows) and parts production by replication (variation chronicled between columns).
Table 4 The One-way Experimental Data Table with n measurements made on each of the p replicated parts at each treatment
Treatments
Parts …
1
2
p
1
y111, y112,…, y1p2
y121, y122,…, y12n
y1p1, y1p2,…, y1bn
2
11m y y211, yy212,…, bp2
12m y y221, yy222,…, 22n
y2p1, y2p2,…, y2bn
…
y21m
y22m
b
yb11, yb12,…, ybp2
Yb21, yb22,…, yb2n
ybp1, ybp2,…, ybpn
ya1m
ya2m
yabm
Given now specified values of Dt, b, p and n, one may determine the noncentral quantity 2, hence , and then , k, ν1, ν2, ν and lastly by using Patnaik’s approximation
Power (1 treatment ) F1 ( , 2 )
1 1
F1 ( 1 , 2 )
compute the expected power (1 - treatment) of detecting treatment effect of factor B. Thus, this approach would allow an investigator to computationally zero in on the required number of parts or replications p to be produced in each treatment in the experiments and the required number of measurements n to be taken on each part. This aspect of designing
experiments is not straightforwardly guided by published literature, except the broad directions given to invoke Pearson-Hartley charts. As shown below, appropriate nomograms that would display treatment power contours for suitably stated experimental situations can actually be provided. Figures 2 and 3 show two instances.
Power (1-β) with α = 0.05, Dt = 0.5 and = 0.89 10 9 8 7
0.8-1
p→
1-β
6 5 4
0.60.8
3 2 2
3
4
5
6
7
8
9
10
n→ Figure 2 Contours of Iso-Detection Power (1-β) as function of part replication p and repeated measurements n
Power (1-β) with α = 0.05, Dt = 0.5 and = 0.89
0.5 0 2 3 4 5 6 7 8 9
n→
2 10
3
4
5
6
7
8
10 9
0.5-1 0-0.5
p→
1-β
1
Figure 3 Typical Escalation of Detection Power (1-β) when n and p are raised
We note that in this procedure the treatment power (1 - treatment) determined would use the observed error variability for 2, which may be estimated from Table 4 (see Duncan (1974), page 612). 2 includes measurement variability among other factors but not the effect of B and parts replication. Given an experiment’s target detectability Dt, if one confronts a large 2, that would require higher values of n and/or p—an optimization issue to be resolved by cost or other considerations (see page 617, Duncan (1974)). Other approaches that could inspire comparable “mapping” of a 2-way (or larger) data table to the 1-way (or thus smaller) data table incorporating replication of parts and multiple measurement of each part may be found in Fisher (1928, page 654), Patnaik (1949, pages 202232), Pearson and Hartley (1972, page 66-74), Fisher (1966), Dean and Voss (1999, pages 33-56, 168) or Montgomery (2007, pages 175-196).
Validation of the Method For quick validation we compare sample size determination based on Person-Hartley curves (Pearson and Hartley 1972) to the method using Patnaik’s approximation for the noncentral F (i.e., F ), for a fixed effect one-factor experiment. Here these two approaches would both attempt to determine the number of process samples (parts or experimental units) that must be used in each treatment to satisfy pre-specified significance () and a target power (1 - ). The milk yield problem of Gill (1968) involved comparison of daily milk yield of Holstein and Brown Swiss cows (two treatments or a = 2) for whom standard deviation of milk production (among cows of each genre) was 4.5 kg while the test specified a detectable difference Dt equal to 3 kg. Significance was 0.05 while the number of cows to be tested had to meet a target power (1 - ) equal to 0.8 and 0.5 respectively. No mention was made in the problem about any imperfection in measurement. For the target power of 0.8 Patnaik’s approximation for F produced a sample size of 38 to compare Holsteins with Brown Swiss while Gill reported a size of 37 using plots of the power curves. For the target power of 0.5 the computed sample size was 19 while Gill reported 18. For another two types—Guernsey and Jersey—compared, standard deviation was 3.3. For a target detectability (Dt) of 3 kg, = 0.05 and power specified at 0.8, sample size computed by F was 21 while Gill determined it to be about 20 using power curves.
This author has also exactly reproduced three β risks shown on page 190 of Montgomery (1997). To be noted is that these are only three illustrative situations. Patnaik’s approximation, using built-in statistical function such as F.DIST.RT(Q22,Q17,Q11) available in Excel®, can deliver virtually any risk estimate in short notice, given n, a, b, D2 and 2 (in Montgomery’s notation). In the following section we illustrate a real life application of the noncentral F approximation in an industrial statistical experiment involving sample size determination to compare two machining technologies.
A Practical Application of Sample Size Calculation by Noncentral F We use a problem of sample size selection for a manufacturing process improvement project that involved a single factor fixed effects statistical design. Precision ball bearings (Figure 4), in order to deliver trouble-free operation for these for tens of thousands of hours unattended, require the execution of very precise manufacturing processes. Typical among these are the bearings made by SKF, Dynaroll, and others who make a wide variety of bearings classified as ABEC models 1 to 9. These models steadily increase in precision, spanning applications from roller skates to supporting rotating shafts in modern Jets and spacecraft. A plant producing ABEC 5 class of bearings was incurring about 1% internal failure losses and it felt that it had poor control on its existing ring machining process. A Six Sigma DMAIC (Pyzdek 2009) intervention was planned to compare two different ring machining methods (a = 2) with comparable by measuring their relative impact on dimensional accuracy.
Figure 4 The parts of a ball bearing
The characteristic chosen to run the designed experiments was the outer dia (OD) of the bearing’s inner ring with dia of 18 mm and tolerance of ± 0.004 mm or ± 4 micron. Prevailing Cpk was 1.5 with the process centered, which led to the estimation of the observed standard deviation of the ring machining process as 0.89 micron. To conduct the experiment significance () was set at 0.05 and a power of 0.9 for detecting a mean shift (Dt) of 2 micron was targeted. The problem thus became that of determining how many rings must be sampled from the process to give the one-factor DOE the desired power. Observed ring size data was assumed to include measurement errors and the effect of all factors other than treatment and replications. The application of the methodology based on noncentral F led to a replication number p = 3 with two (n = 2) repeat measurements, yielding a power of 0.90. A replication of 2 with n = 2 indicated a projected power of 0.66. At this point, since the required sample size turned out to be small, a gauge R&R study was initiated to check whether measurements introduced significant variability in the observed data.
Table 5 Power Calculations using noncentral F Approximation Dt α 1-α
2 0.89 0.05 0.95
a p
2 3
ν1
1
a-1
ν2
6
a(p-1)
2
micron micron
two technologies (treatments) Process replicates (units)
φ φ
7.574801 noncentrality quantity, npDt2/(2a2) 2.752236
λ k ν
15.1496 16.1496 8.332789
φ2(ν1 +1) (ν1 + λ)/ν1 (ν1 + λ)2/ (ν1 + 2λ)
5.987378 0.370745
= F.INV.RT(α, ν1, ν2) from Excel®
0.902344
Power of Detecting Treatment effect
F1-α(ν1,ν2) (ν1/(ν1+))F1-α(ν1,ν2) 1-βtreatment
Power computation steps based on the noncentral F approximation method are shown in Table 5—all used Excel®. Figures 2 and 3 display the sensitivity of detection power (1-β) to replication number (p) and repeat measurements (n) for a case when Dt was set at 0.5 micron.
Sensitivity of the 1-Way ANOVA’s Detection Power to Design Parameters A practical difficulty in using power charts that depict the OC curves of ANOVA is to conduct elaborate sensitivity studies, for the overall power of ANOVA depends on several different factors, parameters and conditions. These jointly control the extent to which an investigator may manipulate these operating variables and work out the most sensible economic and statistical tradeoffs in reaching a given detection power. But such facility becomes immediately accessible when one uses Patnaik’s approximation. This is illustrated as follows. We saw above that computations delivered rather small sample sizes when Dt was set at 2 micron. The impact of tightening Dt to 1 micron can also be estimated in the bearings case, without invoking the power charts. The new candidate sample size (p, n) combinations may be evaluated computationally and thus quickly—Table 6 shows some selected What-if using α = 0.05, = 0.89 micron and Dt = 0.5 micron. Such numbers would permit optimization of sample sizes based on the cost of replication vs. the cost of additional measurements. Table 6 What-if Evaluation of Power (1-β) based on the noncentral F Approximation Formulas 1-β Repeat measurements = 2 3 4 5 6 7 8 9 10
replicates = 2
3
4
5
6
7
8
9
10
0.10 0.14 0.17 0.21 0.23 0.26 0.32 0.35 0.38
0.13 0.19 0.23 0.27 0.35 0.39 0.43 0.47 0.56
0.16 0.23 0.32 0.38 0.43 0.48 0.58 0.62 0.65
0.20 0.27 0.38 0.45 0.55 0.61 0.65 0.74 0.78
0.22 0.35 0.43 0.55 0.62 0.72 0.76 0.84 0.86
0.25 0.39 0.48 0.61 0.72 0.77 0.85 0.88 0.92
0.31 0.43 0.58 0.65 0.76 0.85 0.88 0.93 0.96
0.34 0.47 0.62 0.74 0.84 0.88 0.93 0.96 0.97
0.37 0.55 0.65 0.78 0.86 0.92 0.96 0.97 0.98
Factors that the investigator generally has influence on in order to raise the one-way ANOVA’s detection power are α (the significance level of the ANOVA), 2 (error variance), number of treatments (a), number of replicates (p), number of measurements (n) taken per unit sample experimentally produced, and Dt—the desired shift detection target. These six factors are linked nonlinearly to each other and to power through quantities and expressions shown in Table 5. In order to conduct a sensitivity study, we set up an L8 orthogonal array experiment (Phadke (1989), page 286) with the goal to discover only the relative main effects of these factors on 1-β. All six factor effects interact, hence we used Linear Graph (2) of L8 to assign α to column 1, p to column 2, n to column 4, and Dt to column 7, and arbitrarily assigned a to column 5 and to column 3. The 2-level working ranges were 0.01 and 0.10 for α, 0.5 and 2.0 for , 2 and 5 for a, 2 and 5 for p, 2 and 5 for n, and 0.5 and 2.0 for Dt. Power 1-β would be the “experimentally observed” (computed) response in each L8 row experiment. Results of the OA-aided sensitivity test (shown in Figure 5) indicate that the value of each of the six DOE design factors—α, , a, p, n, and the required detectability Dt—would influence treatment detectability Power (1 - ). What is apparent from Figure 5 is that (1 - ) is strongly impacted by the background process and measurement variability ()—the smaller the better. (We have excluded here the interaction plots.) This inference agrees with one’s expectation. Power is also influenced by the number of parts replicated (p) at each treatment and the number of measurements made per part (n)—in both cases the higher the better. Thus the results highlight that high error variance, fewer repeat measurements, and a small detectability requirement are all detrimental to detection power. The plots incidentally show also that when the significance of ANOVA (α) is reduced (0.05 to 0.01), detection power goes down—a wellknown phenomenon in test of hypothesis, negating reservations about the construction of Table 5. Such sensitivity evaluation would be quite challenging if one used only the power curves.
Detection Power
DEO Design Factor Effects on Detection Power (1-β) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 5 Effects of DOE and ANOVA settings on Overall Power (1 – )
Contributions and Limitations of this work Fisher’s (1928) and Tang’s (1938) contributions in the form of analytical results for the noncentral 2 and F distributions led Patnaik (1949) to developing approximations for some of these distributions so that power estimation in ANOVA would become computationally tractable. The same results inspired the development of OC curves for ANOVA, the most celebrated ones being the Pearson-Hartley (1972) charts and tables, which remain the mainstay in planning statistical experiments today. The charts are graphical, and given the relevant parameter values help one to estimate the detection power of experiments being contemplated. Literature has also gained much from the development of useful noncentrality parameters. One finds these in most leading texts on DOE, including Duncan (1974), and Montgomery (2007). The challenge, nonetheless, remains in physically invoking the power curves in the charts, interpreting some of which frustrates even a sharp eye. As the result, many shortcuts are made in practice. In particular, sensitivity studies with DOE decision variables are often bypassed or compromised.
This study has attempted to test the efficacy of some classical results in statistical literature, primarily the work of Patnaik (1949) on noncentral F distribution and the Pearson-Hartley tables, to help in the sound conduct statistical experiments. This study used the single-factor one-way ANOVA procedure to demonstrate the method for speedily determining replicates required and the need for multiple measurements of each part produced—to minimize the chances of drawing wrong conclusions from such experiments. It uses treatment detection power to assess the need to establish a process’s natural variability, and the capability of the measurement system applied to measure the response (the outcome) of the experiments. This work has explored the computational suitability in sample sizing of some very rich research results—some over six decades old. In particular, here Patnaik’s approximations have been appraised for their value and efficacy. The experience with these approximations is impressively expedient. In fact, certain tasks have been attempted here for which the direct use of power curves is exasperating. Admittedly, this study used modern computing power, making many of the manipulations practicable. In this light, this author asserts that there may be many such gems tucked away in our dusty libraries, the tendency today being “reduce your strains…search the internet to locate some ‘acceptable method’ to do the job.” We suggest that we should pause and use the same zeal to attack problems of the kind that, for instance, led Fisher and Hartley to develop their original power curve charts. Given a handy tool that can be sharpened and kept in good use, one would be impelled to stretch the application of innovative results to areas that remain challenging. This work addressed one such case. Converted into computer codes, Patnaik’s approximation to the noncentral F was hugely successful in very quickly yielding power numbers that matched published data in texts, and also helped in elaborate parametric sensitivity studies. This work has hinted (see Figures 2 and 3) on the possible production of some new and useful nomograms to help practitioners pick sample sizes in some reasonable scenarios. A crude example is sketched in Figure 6. This work has also extended the utility of accessible analytical work where one’s tendency would be to rather accept conservative answers, or worse, ignore the issue.
As for its limitation, we ignored interaction of factor effects at several places. Incorporating interaction could improve results. Also, we worked only with the one-way ANOVA problem. This work, therefore, is perhaps a beginning.
Overall Power (1 - ) Contours D/ = 1, = 0.05 15
0.5 14
0.4
13 12
0.3
11 10 9
0.2
Overall Power
8
No. of process samples or parts
7 6
0.1 5 4 3 2 2
3
4
5
6
7
8
9
10
11
12
13
14
15
No. of Measurements/Process sample
Figure 6 Overall Power Contours with = 0.05
References AIAG (2002). Measurement System Analysis, 3rd ed., Automotive Industry Action Group, Automotive Division, ASQ Milwaukee. Dean, Angela and Voss Daniel (1999). Design and Analysis of Experiments, Springer. Donner, A (1984). Approaches to Sample Size Estimation in the Design of Clinical Studies, Statistics in Medicine, Vol 3, 199-214. Duncan, A J (1974). Quality Control and Industrial Statistics, R D Irwin Inc. Fisher, R A (1928). Proceedings of the Royal Statistical Society A, 121, 654 Fisher, R A (1966). The Design of Experiments, 8th ed., Hafner. Gill, J L (1960). Sample Size for Experiments on Milk Yield, J. Dairy Science, 984-988. Gryna, F M, Chua, R C H and Defeo, J (2007). Juran’s Quality Planning & Analysis for Enterprise Quality, 5th ed., Tata McGraw-Hill.
Hoenig J M and Heisey D M (2001). The Abuse of Power: The Perceived Fallacy of Power Calculations for Data Analysis, The American Statistician, Vol 55, 1, 19-24. Juran, J M and Gryna, F M (1993). Quality Planning and Analysis, 3rd ed., McGraw-Hill. Montgomery, D C (2012). Introduction to Statistical Quality Control, 4th ed., Wiley Montgomery, D C (2007). Design and Analysis of Experiments, 5th ed., Wiley Patnaik, P B (1949). The Noncentral Chi-Square and F-Distributions and their Applications, Biometrika, 36, 202-32. Pearson, E S and Hartley, H O (1972). Biometrika Tables for Statisticians, Vol 2, Cambridge University Press. Pearson, E S and Tiku M L (1970). Some Notes on the Relationship between the Distributions of Central and Noncentral F. Biometrika 57, 175-9. Phadke, Madhav S (1989). Quality Engineering using Robust Design, Prentice-Hall Pyzdek, Thomas (2009). The Six Sigma Handbook, 3rd ed., McGraw-Hill Simanek, D E (1996). Error Analysis (Non-Calculus), http://www.lhup.edu/~dsimanek/errors.htm Taguchi, Genichi, Wu Y and Chowdhury S (2004). Taguchi's Quality Engineering Handbook, John Wiley. Tang, P C (1938). Statistical Research Mem. 2, 126 Tiku, M L (1966). A Note on Approximating the noncentral F Distribution, Biometrika 53, 415-27. Tiku, M L (1967). Tables of the power of the F-test, Journal of American Statistical Association, 62, 525 Wheeler, D J (2009). An Honest Gauge R&R Study, Manuscript No. 189, 2006 ASQ/ASA Fall Technical Conference.