Applications of Computer-Intensive Statistical ... - ScienceDirect

13 downloads 5640 Views 476KB Size Report
Applications of Computer-Intensive Statistical Methods to Environmental Research. Douglas G. Pitt1 and David P. Kreutzweiser. Canadian Forest Service, 1219 ...
ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY

39, 78—97 (1998)

ENVIRONMENTAL RESEARCH SECTION B ARTICLE NO.

ES971619

REVIEW Applications of Computer-Intensive Statistical Methods to Environmental Research Douglas G. Pitt1 and David P. Kreutzweiser Canadian Forest Service, 1219 Queen Street East, Sault Ste. Marie, Ontario, Canada P6A 5M7 Received September 12, 1997

Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . . . . . . . 2. Review of Conventional Statistical Approaches 3. Computer-Intensive Methods . . . . . . . . . . . 3.1. Randomization or Permutation . . . . . . . 3.2. Resampling or Bootstrapping . . . . . . . . 3.3. Simulation . . . . . . . . . . . . . . . . . . . 4. Illustration of Methods. . . . . . . . . . . . . . . 4.1. Simple Treatment Comparisons . . . . . . . 4.2. Count Data. . . . . . . . . . . . . . . . . . . 4.3. Design Complexities. . . . . . . . . . . . . . 5. Discussion . . . . . . . . . . . . . . . . . . . . . . 5.1. Random Sampling. . . . . . . . . . . . . . . 5.2. Normality. . . . . . . . . . . . . . . . . . . . 5.3. Homogeneity of Variance. . . . . . . . . . . 5.4. Test Statistics and Experimental Designs . 5.5. Caveats . . . . . . . . . . . . . . . . . . . . . 6. Conclusions . . . . . . . . . . . . . . . . . . . . . 7. Appendix Resampling Stats, Programs 1—7 . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

78 79 80 81 81 81 82 83 83 85 86 87 87 87 87 87 88 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 97

conventional parametric and nonparametric methods. Computer-intensive statistical methods involve reshuffling, resampling, or simulating a data set thousands of times to empirically define a sampling distribution for a chosen test statistic. The only assumption necessary for valid results is the random assignment of experimental units to the test groups or treatments. Application to a real data set illustrates the advantages of these methods, including freedom from distribution assumptions without loss of power, complete choice over test statistics, easy adaptation to design complexities and missing data, and considerable intuitive appeal. The illustrations also reveal that computer-intensive methods can be more time consuming than conventional methods and the amount of computer code required to orchestrate reshuffling, resampling, or simulation procedures can be appreciable. ( 1998 Academic Press

Conventional statistical approaches rely heavily on the properties of the central limit theorem to bridge the gap between the characteristics of a sample and some theoretical sampling distribution. Problems associated with nonrandom sampling, unknown population distributions, heterogeneous variances, small sample sizes, and missing data jeopardize the assumptions of such approaches and cast skepticism on conclusions. Conventional nonparametric alternatives offer freedom from distribution assumptions, but design limitations and loss of power can be serious drawbacks. With the data-processing capacity of today’s computers, a new dimension of distribution-free statistical methods has evolved that addresses many of the limitations of 1To whom correspondence should be addressed. Fax: 705-759-5700. E-mail: [email protected]. 78 0147-6513/98 $25.00 Copyright ( 1998 by Academic Press All rights of reproduction in any form reserved.

. . . . . . . . . . . . . . . . . .

REVIEW

1. INTRODUCTION

Experimental data and their statistical properties provide the foundation for virtually all inferences in environmental research. The benthologist counts aquatic insects to evaluate the effects of a pollutant; the wildlife biologist counts woody debris to characterize habitat structure; the silviculturist measures the growth of crop trees to quantify the effects of an experimental site preparation method. Statistical methods are then used to objectively assign probabilities to hypotheses of interest. Conventional parametric statistical methods involve the comparison of an observed test statistic to some known distribution, such as F, t, or s2, under the strong assumption that the statistic follows such a distribution. However, data associated with field experiments can be plagued by one or more problems that can seriously complicate their interpretation by such statistical methods. These problems, rarely mutually exclusive, include nonrandom sampling, samples from populations of unknown distributions, heterogeneous variances, small sample sizes, and missing data. With sample sizes typically available, underlying statistical assumptions are frequently difficult to test and the inferential errors that are a consequence of their violation can rarely be identified. Nonparametric statistical methods free the researcher from assumptions relating to the distribution of a test statistic (Conover, 1980). Conventional nonparametric methods include alternatives to the parametric t test (Mann—Whitney or Wilcoxon rank-sum test), analysis of variance (ANOVA) (Kruskal—Wallis test for completely randomized designs and Friedman test for randomized block designs), and regression (rank correlation and nonparametric linear regression), as well as methods for dealing with binomial and count data (binomial test and s2 goodness-of-fit test or test for independence). Freedom from assumptions about distribution offered by these tests does, however, come at a price. Of common concern is the loss of power (or ability to detect treatment differences) and information that necessarily results from reduction of the data to a set of ranks (Potrin and Roff, 1993). Further, textbook methods and computer packages for nonparametric tests are available only for relatively simple designs. More complex designs, such as the split-plot, nested, repeated-measures, and complex regression designs, do not have obvious nonparametric equivalents. Procedures for coping with missing data are also not clearly defined. Methods for binomial and count (discrete) data are generally quite powerful, but require subtle and critical assumptions that are often not considered or understood. Most importantly, each element being counted is assumed to have been independently and randomly selected from a given population. This assumption is easily met in biomedical experiments, for example, where a drug is administered to a random selection of individual subjects. However, in en-

79

vironmental experiments, treatments are often applied to clusters of organisms (e.g., cages of insects or fish, plots of trees). Counts of organisms exhibiting a response characteristic within such a cluster are not independent and a direct application of goodness-of-fit or test of independence to such data can seriously distort the interpretation of results (Hurlbert, 1984). A further complication that arises from experimental treatments is that they are usually applied to replicate clusters. This arrangement leads to a series of ‘‘replicate’’ contingency tables for which there are no obvious means of coping. Conventional analysis methods also do not accommodate small expected cell frequencies (i.e., values less than 1 or when 20% of the cells are less than 5). This necessitates combining classes, at the sacrifice of information and statistical power. Fisher (1935) first suggested that a null hypothesis be tested by determining how frequently an observed test statistic is equaled or exceeded in a list of all possible orderings of a data set. The concept evolved into Fisher’s exact test, but was never really expanded to more complex designs and applications until recent growth in power and availability of the desk-top computer made tedious reorderings and repeat calculations practical. With the modern computer, Fisher’s ideas have evolved into a new dimension of nonparametric, distribution-free statistical techniques often referred to as computer-intensive methods. These methods involve reshuffling, resampling, or simulating a data set thousands of times to empirically obtain a sampling distribution for a test statistic. The empirical sampling distribution can then be used to generate confidence intervals and/or determine how far an observed statistic deviates from its expected value under a null hypothesis. Computer-intensive methods offer several distinct advantages over conventional nonparametric statistical methods. These advantages include freedom from distribution assumptions with little or no loss of power, use of conventional test statistics (such as t and F) or any more intuitive test statistic that the researcher may contrive, adaptation to virtually any experimental design, and straightforward accounting of missing data. Also, computer-intensive methods are based on a logical foundation that is easily understood by nonstatisticians, suggesting their value as teaching and communication aids. Despite these advantages, computer-intensive methods are not presented in many of the popular statistical reference texts (e.g., Conover, 1980; Draper and Smith, 1981; Hicks, 1982; Johnson and Wichern, 1988; Neter et al., 1990; Winer et al., 1991; Milliken and Johnson, 1992), and are only briefly mentioned in a few (e.g., Steel and Torrie 1980; Sokal and Rholf 1981, Montgomery, 1991). Moreover, these methods are rarely covered in statistics courses. Perhaps it is not surprising, therefore, that computer-intensive methods are seldom used as a means of determining significance in the literature.

80

REVIEW

The objective of this review is to increase awareness of these methods by providing an overview, illustration, and appraisal of the steps involved in their application to experimental data. Experimental situations and data taken from forestry-related environmental research projects are illustrated, complete with Resampling Stats (Resampling Stats Inc.) code. The authors hope that this paper will serve as a computer-intensive methods ‘‘primer’’ for environmental researchers, many of whom lack the time and formal statistical training to consult some of the original works on this subject (e.g., Fisher, 1935; Pitman, 1937; Kempthorne, 1952; Bradley, 1968; Efron, 1979a; Edgington, 1980; Manly, 1991). 2. REVIEW OF CONVENTIONAL STATISTICAL APPROACHES

Before detailing computer-intensive methods of data analysis, it is useful to briefly review conventional statistical approaches through a simple example. Consider the objective of estimating the mean of some population with unknown characteristics [in this case an exponential distribution with mean (k) and standard deviation (p) equal to 10]. A random sample of size 30 is drawn; the resulting FIG. 1. (A) Random sample of size 30 drawn from an exponentially classified data (rounded to the nearest whole number) are distributed population with mean and standard deviation equal to 10. The illustrated in Fig. 1A. The sample mean (xN ) is 9.3047; the sample mean and standard deviation are 9.3047 and 6.3445, respectively. The means of 5000 additional random samples of size 30 are approxisample standard deviation (s) is 6.3445. Despite a relatively (B) mately normally distributed. large sample size, the nonnormality of the underlying population could not be detected in this case (P"0.1096; Shapiro-Wilk Statistic, Shapiro and Wilk, 1965). The sample mean provides an unbiased estimate of the the samples were drawn (Solomon, 1987). The larger the population mean, but it is not definitive and different sam- sample size, n, the better the approximation. ples would yield different estimates. Similarly, the sample However, without the benefit of additional samples, one standard deviation approximates the population standard must rely on the theoretical normality of the distribution of deviation. The uncertainty associated with xN is dependent sample means to express uncertainty in the single estimate on the population variation, estimated by s2, and the size of (9.30). The assumption is that approximately 95% of future the sample on which it is based (n). Combining both of these samples drawn from the test population would fall within factors, the standard error of the sample (sxN "s/Jn"1.16) $1.96]1.16 (i.e., 2.5 and 97.5 percentiles of the standard estimates the standard deviation of the distribution of normal distribution ] the SE of the sample)"$2.27 units sample means that might be expected if additional samples of this single estimate. However, because both k and p have of size 30 could be drawn. been estimated to make this statement, allowance is typiTo illustrate, 5000 additional samples of size 30 drawn cally made by referencing the more conservative t distribufrom the test population yield the distribution of sample tion. With a sample size of 30, 95% of future sample means means displayed in Fig. 1B. This distribution closely ap- can be expected to fall within $2.045]1.16 [i.e., 2.5 and proximates a normal distribution, with 68.26% of the 97.5 percentiles of the t-distribution (n!1"29)]the SE of sample means falling within $1.839 units ($1 standard the sample]"$2.37 units of the estimated mean. In the deviation) of the overall mean (10.003). Both the overall case of 5000 future samples illustrated in Fig. 1B, the range mean and standard deviation of this sampling distribution of values between 6.9 and 11.7 (xN $2.4) encompasses only are very close to their expected values: k"10 and, p/Jn" 3957/5000, or 79% of the distribution of sample means. This 10/J30"1.826. The fact that this observation is made, is less than the expected 95% because both the population despite a non-normal underlying distribution, illustrates mean and variance were slightly underestimated by this a statistical property known as the central limit theorem. particular sample. That is, the distribution of sample means will tend to be In summary, most conventional statistical approaches normal regardless of the form of the distribution from which rely heavily on the properties of the central limit theorem to

REVIEW

bridge the gap between the characteristics of an observed sample and a theoretical sampling distribution. Small sample sizes drawn from nonnormal populations may generate misleading results. In contrast, the following computer-intensive methods allow sampling distributions to be generated empirically from sample data. Significance levels for the chosen test statistics are then interpreted directly from these distributions. 3. COMPUTER-INTENSIVE METHODS

Computer-intensive methods can generally be divided into three main categories based on the way in which samples are generated to derive an empirical sampling distribution. Methods that involve reshuffling or reassigning data (without replacement) to a set of experimental treatments are called randomization or permutation tests. Without replacement, each observation in the original data set occurs only once in the randomized data set. When reassignment is done with replacement, the method is referred to as resampling or bootstrapping. With replacement, each observation in the original data set has equal probability of occurring in each position of the bootstrapped data set. If a model exists for the experimental situation, then future samples may be generated or simulated using the model (sometimes called Monte-Carlo methods). 3.1. Randomization or Permutation Under a typical null hypothesis, samples drawn from the population(s) of interest should not reveal any pattern, except purely by chance. In other words, reordering or ‘‘reshuffling’’ the data should yield results similar to those observed, if the null hypothesis is true. If all possible permutations of the data are listed, the ‘‘extremeness’’ of the observed result may be determined by inspection (i.e., its likelihood of occurring by random chance alone, if the null hypothesis is true). Such randomization or permutation methods are suitable for testing hypotheses and determining confidence limits for parameters. When the entire randomization distribution is enumerated, then the method may be referred to as an ‘‘exact randomization test’’ (Sokal and Rohlf, 1981). As an example, consider samples A and B (Table 1), assumed to be drawn from the same population under the null hypothesis. The original sample, A"(1, 2, 3) and B"(4, 5), gives xN "2.0 and xN "4.5, with a difference of A B 2.5. There are nine additional ways that these five observations can be reassigned to the two samples without replacement (ignoring repeat combinations), all equally likely under the null hypothesis of no difference between A and B. Using the absolute difference between sample means as a test statistic, 2.50 is the most extreme value, occurring with the observed sample and with one other randomization. At

81

a 2 in 10 chance (20% probability) of being wrong, one would probably not favor an alternative hypothesis of the two samples being drawn from different populations. However, if it were initially hypothesized that B were greater than A, the significance level would fall to 10% and one might begin to question the validity of the null hypothesis. Although computers can be relied on to perform tedious randomizations, it may not be possible or practical to enumerate all possible data orders in a randomization test, since the number of required permutations increases dramatically with sample size. Rather, it is usually preferable to ‘‘sample’’ the randomization distribution by repeating the random assignment of observations to the samples several thousand times (N). This method is sometimes referred to as a ‘‘sampled randomization test’’ (Sokal and Rohlf, 1981). Throughout the process, the number of cases (n) in which the difference between means equals or exceeds the observed value can be counted. Including the observed result with n is justified, since the observed data arrangement represents just one of the equally likely possibilities contained in the randomization distribution. The probability, P, that the observed value was obtained by chance is then computed as n/N. Generally, 5000 randomizations are sufficient to arrive at a conclusion similar to what would be obtained if all permutations had been considered (Potvin and Roff, 1993; Manly, 1997). In 5000 random assignments of the data in Table 1, 985 produced an absolute difference greater than or equal to 2.5 (P"0.197+0.2) and 485 produced a difference between B and A greater than or equal to 2.5 (P" 0.097+0.1) (Appendix, Program 1). According to Simon (1992), the randomization method is generally appropriate in situations involving finite populations (i.e., populations containing a limited number of elements, such that the selection of one element changes the probability of selecting another). However, other texts on the subject (e.g., Edgington, 1980; Manly, 1991, 1997) do not make this distinction. A modification of the randomization test, the ‘‘jackknife,’’ involves computing a test statistic many successive times, each with a different observation or group of observations removed from the data set. Like the randomization method, the average and variability of the resulting estimates from the jackknife can be used to draw inferences about a statistic with unknown distribution. For further discussion of the jackknife procedure, see Sokal and Rohlf (1981), Potvin and Roff (1993), or Manly (1997). 3.2. Resampling or Bootstrapping In most environmental experiments, the underlying population(s) can be considered infinite (i.e., it contains an unlimited number of elements, such that the selection of one element does not change the probability of selecting another). In such cases, the random assignment of data to the

82

REVIEW

TABLE 1 10 Possible Randomizations of Two Samples of Sizes 3 and 2 Sample A Original sample Permutation 1 2 3 4 5 6 7 8 9

Sample B

xN A

xN B

xN !xN B A

1

2

3

4

5

2.00

4.50

2.50

1 1 1 1 1 2 2 2 3

2 2 3 3 4 3 3 4 4

4 5 4 5 5 4 5 5 5

3 3 2 2 2 1 1 1 1

5 4 5 4 3 5 4 3 2

2.33 2.67 2.67 3.00 3.33 3.00 3.33 3.67 4.00

4.00 3.50 3.50 3.00 2.50 3.00 2.50 2.00 1.50

1.67 0.83 0.83 0.00 !0.83 0.00 !0.83 !1.67 !2.50

samples in a randomization test may be made with replacement. This technique has been coined the ‘‘bootstrap’’ (Efron, 1979b). The philosophy behind the bootstrap is that, in the absence of additional samples, the existing sample (which is the best proxy to the population) can be expanded to provide a base from which to draw additional samples. Returning to the example in Table 1, sampling with replacement from the five original sample elements might yield the assignment A"(5, 1, 1) and B"(3, 5). In contrast to randomization, sampling with replacement may result in a given observation occurring in more than one position in the bootstrapped sample. A bootstrap test would involve generating, say, 5000 such samples and evaluating P from the results. Applied to the data in Table 1, 5000 bootstrap randomizations resulted in P"0.055 for an absolute difference greater than or equal to 2.5 and 0.029 for B'A (replacing the ‘‘take’’ commands with ‘‘sample’’ in Program 1, Appendix, achieves this analysis). Notably, these P-values are approximately half the size of those previously obtained with the randomization procedure. The reason for this is that the small number of observations in the sample generate only 10 possible outcomes when the data are randomly assigned to the treatments (without replacement) (Table 1). This means that the P-values obtained by the randomization method for the above tests cannot be smaller than 2/10 and 1/10, respectively. Resampling the data with replacement leads to many more possibilities and, consequently, a more precise estimate of the P-values. Such differences between the two methods are exaggerated by small data sets, particularly if there are several repeat observations present. Of course, bootstrapping only offers a solution to this problem if the underlying populations can be considered infinite. The theory supports the bootstrap method with large sample sizes (Efron, 1982; Diaconis and Efron, 1983), but the definition of ‘‘large’’ is usually not clear

FIG. 2. Frequency of sample means in 5000 bootstrap samples of size 30 taken from the original sample depicted in Fig. 1A. The 95% confidence interval (CI) for the mean is 9.30$2.29 determined from the bootstrap sampling distribution. For comparison, the 95% Cl obtained by reference to the t-distribution is 9.30$2.37. The approximate position of the true sampling distribution of means (Fig. 1B) is shown by the solid gray line.

and some empirical verification may be necessary (Manly, 1991). As a further example, the means of 5000 samples of size 30, bootstrapped from the original sample depicted in Fig. 1A, are displayed in Fig. 2. The 2.5 and 97.5 percentiles of this distribution are 7.01 and 11.59, which conform closely to the limits of the 95% confidence interval previously calculated using the t- distribution (6.93 and 11.67). With bootstrapping, it is important that resampled data sets conform to the underlying experimental design. In the above example (data, Table 1), the bootstrap test was based on three observations for A and two for B, as per the original sample. Generating larger sample sizes for A and/or B would be analogous to pseudoreplication. 3.3. Simulation Occasionally, it may be possible to construct a model for how the observed data arose, under the premise of the null hypothesis. The model can then be used to ‘‘simulate’’ or generate a distribution of the chosen statistic. Testing can proceed by comparing the observed statistic with values of the simulated distribution. A simple example of this was illustrated above when the sampling distribution of means, displayed in Fig. 1B, was obtained by ‘‘simulating’’ random sampling from an exponentially distributed population with k and p"10. As with bootstrapping, it is critical that simulated data conform to the underlying experimental design. Generating larger data sets than the original experiment will typically lead to tighter confidence intervals and higher rejection rates than are warranted.

83

REVIEW

population). If this null hypothesis is rejected, then comparisons of each of the two insecticide treatments with the control are also of interest to the experimenter. This data set reflects problems frequently encountered in environmental and ecotoxicological experiments: small sample sizes, questionable normality and homogeneity of variance, and noncontinuous data (i.e., discrete counts or ‘‘quantal’’ data).

TABLE 2 Data from Aquatic Microcosms Treated with Two Experimental Forest Insecticides and from Control Microcosms (Number of Aquatic Insects Surviving out of 10) Replicate

Control

Treatment A

1 2 3 4 5

10 9 8 8 7

10 9 8 8 8

Total Mean Standard deviation

42 8.4 1.14

43 8.6 0.89

Treatment B 9 7 5 5 1

4.1. Simple Treatment Comparisons Parametric ANOVA results for the data in Table 2 are summarized in Table 3. Bartlett’s test (Montgomery, 1991) suggests that the assumption of homogeneity of variance may be violated with the untransformed data (P"0.0526). However, given the small sample sizes, the noncontinuity of the data, and the fact that the residuals from the parametric ANOVA may not follow a normal distribution (P"0.0926; Shapiro-Wilk Statistic), the results from Bartlett’s test are inconclusive. A common approach to analyzing toxicity screening data such as these is to perform ANOVA on the arcsin-transformed square-root percent survival data (e.g., Kemble et al., 1994; Henry et al., 1994). This transformation results in a marginal loss of statistical power (Table 3), but the assumptions of homogeneity of variance (P"0.5585) and normality (P"0.2168) appear to have been satisfied. Nevertheless, the noncontinuity of the data should still cast doubt on the validity of this analysis (i.e., an observation can take on only one of 11 possible values). A rank transformation of the data offers the greatest freedom from the assumptions of conventional methods, but results in further loss of statistical power (Table 3). If the F statistic is still used as the basis for comparing treatments, then 5000 random assignments of the 15 data points to the three treatments (without replacement) leads to the sampling distributions depicted in Fig. 3 (Appendix, Program 2). An overall treatment F value as large or larger than that observed (4.42) was encountered 139 of the 5000 times, providing a P value of 0.0278 (Fig. 3A). Similarly, P values for the contrasts of interest were 0.8634 (4317/5000)

27 5.4 2.97

4. ILLUSTRATION OF METHODS

To more thoroughly illustrate the advantages, limitations, and application of these methods, consider the data summarized in Table 2. These data are subsets of actual responses taken from aquatic microcosm experiments to determine the effects of candidate forest insecticides on aquatic insect survival. Experimental configuration and operation are described by Kreutzweiser (1997). Briefly, 5-liter microcosms containing natural surface water and bottom substrates were randomly assigned one of three treatments: (1) control (no insecticide added), (2) insecticide A (Treat-A), (3) insecticide B (Treat-B). Each treatment was replicated five times (a total of 15 microcosms), and each microcosm contained 10 test insects. At the end of a 28-day posttreatment observation period, the microcosms were disassembled and the number of surviving insects in each microcosm was recorded. Assume, at this point, that the data originate from a completely randomized experimental design. The null hypothesis is that the treatments have no effect on insect survival (i.e., the three samples have been drawn from the same

TABLE 3 ANOVA Results for Experimental Data Summarized in Table 2a Method of determining significance: Transformation:

Treatments Treat-A vs control Treat-B vs control a Error df"12 in each case. b See Appendix, Program 2.

Conventional None

Conventional Arcsin (%)1@2

Conventional Ranks

Randomizationb None

Boostrap None

df

F

P

F

P

F

P

P

P

2 1 1

4.42 0.03 6.19

0.0364 0.8710 0.0285

3.69 0.02 5.20

0.0562 0.8911 0.0416

3.22 0.08 4.15

0.0759 0.7759 0.0642

0.0278 0.8634 0.0212

0.0246 0.8870 0.0230

84

REVIEW

FIG. 3. Frequency of F values in 5000 randomizations of the data in Table 2: (A) for overall treatment effects, (B) Treat-A versus control, and (C) Treat-B versus control. The original sample had F values of 4.42, 0.03, and 6.19 for (A), (B), and (C), respectively.

for Treat-A versus control and 0.0212 (106/5000) for TreatB versus control (Figs. 3B and 3C, respectively). These significance levels are very close to those obtained with parametric ANOVA of the untransformed values (Table 3). In this particular case, power equivalent to the parametric test was achieved without having to make any assumptions about data form or distribution.

Application of the bootstrap method with an F statistic involves replacing the ‘‘take’’ commands in Program 2 (Appendix) with the ‘‘sample’’ command, which invokes sampling with replacement from the pooled data. The results of 5000 bootstrap samples are virtually identical to those obtained with the randomization method: P"0.0246 for overall treatment differences, 0.8870 for Treat-A versus control,

85

REVIEW

and 0.0230 for Treat-B versus control. In contrast to the example data set in Table 1, the mesocosm data offer a sufficient number of randomization outcomes to permit precise P-value estimation. In studies involving more than two treatments, establishing planned contrasts within the framework of the computer-intensive method are straightforward, as this example illustrates. When these contrasts are not orthogonal, a desired experimentwise error rate (e.g., 0.05) may be maintained by adjusting the critical value of individual comparisons downward, according to the Dunn—S[ ida´k correction (Ury, 1976). In the above example, one would use a@"1!(1!a)1@k"1!(1!0.05)1@2"0.0253

(1)

for k"2 comparisons, and declare Treat-B different from the control. With a little extra programming effort, computer-intensive unplanned comparisons, or multiple range tests, may still be conducted (see Edgington, 1980; Petrondas and Gabriel, 1983; Edwards and Berry, 1987). With computer-intensive methods, an ANOVA-type analysis need not be limited to use of the conventional F statistic. The simple difference between treatment means or sums can be used instead. If treatment-induced variation is the focus of attention rather than differences in location (means), response variables such as ln(S) or a signal-tonoise-ratio [e.g., ln(yN 2/S2)] can be used to form a statistic that may be testable only by computer-intensive methods. Similarly, flexibility over the choice of test statistic can be used to advantage in the testing for homogeneity of variance. The randomization or bootstrap method may be used with Bartlett’s test statistic (Montgomery, 1991), a Q"(N!a) ln S2! + (n !1) ln S2, i p i i/1

(2)

where N"the total sample size (15), a"the number of treatments being compared (3), and S2 and S2 are the pooled i p and individual sample variances, respectively. In 5000 randomizations of the data in Table 2, the observed Q value of 6.6259 was equaled or exceeded 1274 times (P"0.2548). The same number of bootstrap samples resulted in a P value of 0.2410 (Fig. 4; Appendix, Program 3). These significance levels deviate considerably from the parametric P value of 0.0526 presented earlier. In light of the sensitivity of Bartlett’s test to normality, continuity, and sample size, the randomization or bootstrap significance level can be considered more reliable when deviant conditions prevail (Manly, 1991).

FIG. 4. Frequency of Bartlett’s test statistic in 5000 bootstrap samples of the data in Table 2. The observed value of 6.6259 is equaled or exceeded 1205 times.

together to arrive at the 3]2 contingency table outlined in Table 4. A test of independence leads to an overall G statistic (Sokal and Rohlf, 1981) of 16.333. Compared with a s2 distribution with 2 df, this result is extremely unlikely (P"0.0003) by random chance alone and one would reject the hypothesis that survival is independent of treatment. However, the 10 insects counted in each microcosm are not independent, as this test assumes, and treating the data in this manner is pseudoreplication (Hurlbert, 1984). In fact, unless the treatments are assigned at random to individual insects (something that may be logistically difficult), these data cannot be analyzed using the method just applied. However, the researcher can continue to employ the contingency table arrangement (Table 4) by taking advantage of the flexibility over test statistics offered by computerintensive methods. The question becomes: How frequently is the observed G statistic equaled or exceeded when the data are randomly assigned to the treatments? This differs from the question asked under conventional analysis: How frequently is the observed G statistic equaled or exceeded in s2 distribution with 2 df ? In a randomization test, the G statistic is being used no differently from F, the difference between treatment sums, or any other test statistic that is

TABLE 4 Data from Aquatic Microcosms (Table 2) Arranged in a 3 3 2 Contingency Table Treatment

4.2. Count Data

Control Treat-A Treat-B

Since the counts provided in Table 2 represent replicate treatments, one might be tempted to add the replicates

Total

Alive

Dead

Total

42 43 27

8 7 23

50 50 50

112

38

150

86

REVIEW

some function of the sample data. Focus is on the distribution of G values rather than their magnitude. When the magnitude of the G statistic is compared with a s2 distribution, the assumption of independent observations at the insect level must be made. With a randomization test, 146 of 5000 arrangements led to a G516.333 (P"0.0292; Appendix, Program 4), a result that compares with that of the F statistic obtained previously (0.0278, Table 3). A similar analysis with the bootstrap approach led to P"0.0454. Contrasts of interest could be conducted by computing G statistics for various partitions of the original contingency table (Everitt, 1977) and then adjusting the critical value downward using Eq. (1). To illustrate the simulation method, assume that the data in Table 4 were derived by randomly assigning treatments to individual insects. Conventional contingency table analysis would then be correct and the significance level would be 0.0003 when referenced to the s2 distribution. Under the null hypothesis that survival is independent of treatment, the probability of any one insect surviving is 112/150, or 74.67%, based on the observed sample (Table 4). Using this very simple model for survival [P(survival)"0.7467], one could repeatedly generate three samples of 50 insects (some live and some dead) and compute a G statistic for each set. The proportion of times these G statistics are 516.333 estimates the significance level for the null hypothesis. In 5000 such simulations (Appendix, Program 5), only 2 gave a result as extreme as that observed (P"0.0004), which concurs with the previous comparison with the s2 distribution. Although the computer-intensive method offers no particular advantage in this case, it would have been the only alternative if one or more of the cells in Table 4 had expected values of less than 5. 4.3. Design Complexities Frequently, the randomization of experimental treatments is restricted to some ‘‘blocking’’ factor. In the example, the replicates could have been arranged spatially or temporally (e.g., each at a different location on a laboratory bench, or each run on a different day). Increased precision may then be realized by removing variation attributable to blocks from the error term. Complexities such as this and other restrictions on randomization are easily accounted for in computer-intensive methods, particularly when conventional test statistics are used. For example, assume that the replicates in Table 2 represent a blocking factor. An F statistic leads to the significance levels summarized in Table 5. In this case, an appreciable amount of the variation among microcosms treated alike is explained by blocking and the test for treatments is much more powerful than it was under a completely randomized design (Table 3). Again, the computer-intensive methods

TABLE 5 Significance Levels for Aquatic Microcosm Data (Table 2), Assuming a Randomized Complete Block Design, as Determined via Conventional Parametric ANOVA and ComputerIntensive Methodsa Conventional Source

df

F

P

Randomizationb P

Bootstrap P

Treatments Blocks Treat-A vs control Treat-B vs control

2 4 1 1

10.83 5.35 0.07 15.17

0.0053 0.0215 0.8017 0.0046

0.0006 0.0134 0.8154 0.0014

0.0028 0.0184 0.8096 0.0032

a Error df"8 in all cases. b See Appendix, Program 6.

(randomization and bootstrapping) offer the same or greater power than the conventional parametric ANOVA, without the assumptions that accompany the latter. Program 6 (Appendix) is a modification of Program 2 that accommodates a randomized complete block design. Since these are quantal data, there may be a need to maintain the form of a contingency table, in which case the example leads to a three-way table consisting of treatments]survival]blocks. The conventional approach to analyzing data in three- and higher-order contingency tables involves log-linear models (Sokal and Rohlf 1981). Such a model for the example would consist of ln fK "k#¹ #S #B #¹S #SB #¹B #¹SB , ijk i j k ij jk ik ijk (3) where fK is the expected frequency in ¹ (treatment) i, S ijk (survival level) j, and B (block) k. Although conceptually analogous to the parametric linear model for continuous data, focus in the log-linear model is on specific interactions, namely, ¹S and SB (just as focus was on the ¹S interaction in the two-way analysis, Table 4). The remaining one-, two-, and three-way tables are not particularly useful. Like the two-way test of independence discussed previously, however, each count (ijk) is assumed to comprise independently and randomly selected individuals from the test population. Strict application of a log-linear model to the data in Table 2 would violate this assumption and not be correct. Once again, though, one can appeal to the flexibility over test statistics offered by computer-intensive methods to solve this problem. Essentially, computation of G statistics for both ¹]S and S]B tables through random assignment of the data (with or without replacement) can indicate how often the observed values of 16.333 and 19.229 are simultaneously exceeded. In other words, it was previously determined that G(¹]S)516.333 146 of 5000

REVIEW

times when blocking was not accounted for in the model; how many times is this value exceeded when blocking accounts for as much variation as it did [G(¹]S)5 16.333DG(S]B)519.229]? In 5000 randomizations (Appendix, Program 7), these two values were exceeded only once (P"0.0002). In 5000 bootstrap samples, 36 produced values as extreme as those observed (P"0.0072). Both results generally concur with those listed in Table 5. 5. DISCUSSION

The random assignment of experimental units to test treatments is the only assumption necessary for the validity of statistical significance acquired through computer-intensive methods; assumptions regarding random sampling, normality, homogeneity of variance, and equal sample sizes are not needed (Edgington 1980). However, when these assumptions are satisfied, computer-intensive methods perform similarly to conventional tests (Manly et al., 1986). It has therefore been argued that the statistical significance levels derived through computer-intensive methods are always correct and that corresponding parametric alternatives are valid only to the extent that they arrive at similar statistical conclusions (Bradley, 1968).

5.1. Random Sampling Random sampling is a prerequisite to the validity of virtually all conventional parametric and nonparametric statistical tests, but it has been argued that true random sampling is actually absent from most experimental work (e.g., Keppel, 1973; Kirk, 1978). For example, researchers typically choose a ‘‘representative’’ stand of trees, stream, or set of surrogate test units with which to conduct an experiment, often because it is the most available. The ultimate intention is to extrapolate experimental results to broader areas of interest, but this must be justified by nonstatistical arguments because true random sampling from the population of interest is difficult or impossible. Strictly speaking, conventional parametric statistical tables are not valid for nonrandom samples (Edgington, 1980). However, statistical inference about the experimental units can be made with computer-intensive methods, provided there has been a random assignment of the units to the treatments. When sampling is not random, assumptions about normality or other characteristics of the population are meaningless and a distribution-free test is warranted. Conventional nonparametric approaches achieve distributionfree status through a rank transformation of the data, which ignores information and results in a commensurate loss of statistical power. Computer-intensive methods, on the other hand, free any statistical test (simple or complex) of distribution assumptions. The statistical power of conventional test

87

statistics (e.g., t or F) can often be preserved, without the need for data transformations (Simon, 1992). 5.2. Normality It is widely accepted that conventional parametric t and F tests are quite robust to the assumptions of normality. In fact, Kempthorne (1952) and Montgomery (1991) justify the robustness of t and F tests to nonnormal distributions by virtue of the fact that they provide good approximations to the randomizaton test! However, the negative effects of such violations can be strongly exacerbated by small sample sizes and/or missing data. With conventional parametric analyses, small sample sizes prevent the adequate testing of model assumptions and unequal sample sizes can distort results and create serious computational complexities (Milliken and Johnson, 1992). In contrast, small sample sizes and/or missing data do not affect the validity of computerintensive methods. This is not to condone the use of small samples, since larger samples will always be more informative and lead to greater statistical power, regardless of the method used to determine significance. Computer-intensive methods simply permit valid analysis of a small data set, without regard to assumptions that may be difficult or impossible to test. Similarly, missing data are simply accounted for in the shuffling, resampling, or simulation algorithm by creating samples equivalent to their original size (e.g., data, Table 1). 5.3. Homogeneity of Variance Homogeneous treatment variances are essential for the validity of conventional parametric and nonparametric tests. However, nonnormality and/or small sample sizes often preclude the adequate testing of this assumption. Bartlett’s test, commonly used for testing homogeneity of variance, is particularly sensitive to these deficiencies (Montgomery, 1991). Shortfalls in the detection of heterogeneous variances may be overcome through the use of a computer-intensive method and a test statistic such as Bartlett’s (e.g., Manly, 1991) or simple variance ratios (e.g., Sokal and Rohlf, 1981). Although homogeneity of variance is not a requirement for the validity of results generated by computer-intensive methods (Edgington, 1980), significance may be altered through data transformation. In other words, treatment differences may be evident at some scales of measurement and not at others. Therefore, interpretation of an analysis must reference the scale of measurement used (Manly, 1991). 5.4. Test Statistics and Experimental Designs As suggested by the illustrations in this review, one of the main advantages of computer-intensive methods is

88

REVIEW

complete choice over the test statistic used. In the simplest cases, the difference between sample means or medians can be used. Sums of squares and t, F, r, or s2 statistics can also be computed to obtain tests that are analogous to the conventional parametric methods [see Edgington (1980) for equivalent test statistics]. Standard deviations or variances can be used as statistics to test homogeneity of variance assumptions (as described above) or to study treatmentrelated effects on variance. Using computer-intensive methods, the researcher has the freedom to devise statistical tests that better suit their situation than conventional statistics (Manly, 1997). The same freedom can be extended to experimental designs as well, since computer-intensive methods can be tailored to accommodate very unique randomization procedures. The methods presented here can be extended to applications in repeated measures analysis (Moulton and Zeger 1989; Manly 1997), regression and correlation (Edgington, 1980; Sparks and Rothery, 1996; Magnussen and Burgess, 1997; Manly, 1997), multivariate statistics (Edgington, 1980; Manly et al., 1986; Manly, 1997), time series (Manly, 1997), and sampling (Schreuder et al., 1993). 5.5. Caveats On the other hand, computer-intensive methods generate valid results only if properly applied and conducted. Selection of a meaningful test statistic and a full understanding of the underlying experimental design (randomization process) are critical first steps in determining statistical significance by reshuffling, resampling, or simulating a data set. One should not rely on computer-intensive methods to mitigate problems associated with poor data (e.g., those containing bias and/or lacking independence) (Sparks and Rothery, 1996), or artificially inflate sample sizes and, subsequently, experimental precision. Unlike classical methods, there are no established steps or ‘‘recipes’’ to follow with computer-intensive methods. Whereas reference to ‘‘PROC ANOVA’’ in SAS is virtually all that is required to conduct a parametric one-way ANOVA, computation of an equivalent test statistic within a computer-intensive algorithm requires programming to compute all of the necessary mean squares, and can be quite time consuming. Existing software, such as Resampling Stats and StatXact, ‘‘package’’ many of the necessary routines, such as comparing and counting values, shuffling, randomizing, and graphing, but randomization within the framework of the underlying experimental design and the calculation of the chosen test statistic are typically the user’s responsibilities (although this may change with future software development). Complex test statistics and experimental designs can complicate the formulation of the reshuffling, resampling, or simulation algorithms and computational error checking is essential. The algorithm

developed to perform the computations should always be tested on the observed data for which the value of the test statistic is known.

6. CONCLUSIONS

Conventional statistical theory is founded largely on an understanding of a family of key distributions, centered on the bell-shaped or normal curve. At the root of this theory, statistical problems are simplified by making certain assumptions (often unverifiable) about the sampling distribution of a data set and comparing observed results with what is expected under the assumed distribution. Fisher’s early foray into randomization testing suggested that the empirical definition of a sampling distribution for a test statistic would obviate assumptions about the underlying distribution. Unfortunately, the sheer labor intensity of his approach made it impractical in its time. His efforts are, however, evidence that the fathers of conventional statistical doctrine may have chosen a very different foundation on which to build their theories, had computers been available. As Efron (1979a) alludes, computers have redefined the manner in which a statistical problem can be ‘‘simplified.’’ Computer-intensive methods are thus influencing the development of current statistical theory. While virtually any statistical problem can be solved using computer-intensive methods, there are currently no advantages (time or otherwise) in applying them to problems meeting the assumptions of conventional parametric statistical analysis. It is likely to be a number of years before computer-intensive methods are packaged in software that is as efficient and easy to use as that available for conventional parametric analysis. However, the limitations of conventional nonparametric approaches (e.g., loss of power, design limitations) may outweigh any investment required to tackle the problem with computer-intensive methods. Thus, whenever data problems or design complexities jeopardize the validity or application of conventional parametric or nonparametric tests, computer-intensive methods are strongly recommended. The advantages of these methods over conventional approaches have herein been illustrated with real data. Resampling Stats programs offered in the Appendix are designed to help the reader begin to apply these methods in practice. As Potvin and Roff (1993) admonish, computer-intensive methods are ‘‘not a panacea.’’ However, given their considerable intuitive appeal and our rapid advancement into the computer age, dramatic increases in the popularity of these methods are likely in the near future. In environmental studies involving parameter estimation and hypothesis testing, it is likely that researchers will benefit from having computer-intensive methods among their kit of statistical tools.

REVIEW

89

7. APPENDIX

Program 1 The following Resampling Stats program conducts a simple randomization test on two samples of unequal size. maxsize default 5000 data (1 2 3) A data (4 5) B concat A B C

’Create space for vectors of 5000 elements. ’Read in data for sample A. ’Read in data for sample B. ’Concatenate samples A and B, place in c.

repeat 5000 shuffle C D take D 1,3 E take D 4,5 F mean E EE mean F FF subtract FF EE G score G H end

’Run 5000 randomizations. ’Shuffle the elements of C, place them in D. ’Take the first 3 elements of D and assign them to E. ’Take the last 2 elements of D and assign them to F. ’Compute means of the two vectors2

histogram H count H[\2.5 I divide I 5000 P print P

’Construct a histogram of results. ’Count all results [ or \2.5.2 ’Compute P-value. ’Print results.

’Compute the difference between means, store them in G. ’Keep track of the differences, store them in H. ’Loop.

Program 2 The following Resampling Stats program conducts a simple one-way ANOVA on the mesocosm data listed in Table 2; significance is determined by a randomization test. maxsize default 5000 data (10 9 8 8 7) T1 – R data (10 9 8 8 8) T2 – R data (9 7 5 5 1) T3 – R data (2) DFTR data (12) DFE

’Set maximum vector size to 5000. ’Input the survival values for each treatment; T1\control, ’T2\Treat-A, and T3\Treat-B (Table 2).

size T1 — R N — T1 size T2 — R N — T2 size T3 — R N — T3

’Compute the sample size for Treatment 1. ’Compute the sample size for Treatment 2. ’Compute the sample size for Treatment 3.

add N — T1 N — T2 N — T3 N concat T1 – R T2 – R T3 – R ALL – R

’Compute the total sample size. ’Then concatenate the raw data, place in ALL – R. ’Run 5000 randomizations. ’Shuffle the data. ’Take the first 5 and assign to TRT12

repeat 5000 shuffle ALL – R RAND take RAND 1, 5 T1 take RAND 6, 10 T2 take RAND 11, 15 T3 concat T1 T2 T3 ALL

’Input the d.f. treatments (3!1). ’Input the d.f. error (3(5!1)).

’Concatenate the data.

sum ALL GTOT square GTOT GTOTsq divide GTOTsq N CT

’Compute the correction term, CT.

square ALL ALLsq sum ALLsq UCSST subtract UCSST CT SST

’Compute the total sum of squares, SST.

sum T1 sumT1 sum T2 sumT2 sum T3 sumT3 square sumT1 sumT1sq square sumT2 sumT2sq square sumT3 sumT3sq divide sumT1sq N — T1 SS1 divide sumT2sq N — T2 SS2 divide sumT3sq N — T3 SS3

’Compute the treatment sum of squares, SSTR.

2When working with discrete data, always check the vector of results to ensure that rounding errors are not causing critical values to be overlooked.

90

REVIEW

add SS1 SS2 SS3 UCTSS subtract UCTSS CT SSTR concat T2 T1 T12 sum T12 C1 add N — T1 N — T2 N — T12 square C1 C1sq divide C1sq N — T12 CTC1 add SS1 SS2 UCC1SS subtract UCC1SS CTC1 SSC1

’Contrast T1 vs T2, SSC2.

concat T1 T3 T13 sum T13 C2 add N — T1 N — T3 N — T13 square C2 C2sq divide C2sq N — T13 CTC2 add SS1 SS3 UCC2SS subtract UCC2SS CTC2 SSC2

’Contrast T1 vs T3, SSC1.

subtract SST SSTR SSE

’Compute Error sum of squares, SSE.

divide SSTR DFTR MSTR divide SSE DFE MSE

’Compute mean squares, MSTR MSE.

divide MSTR MSE F – TR divide SSC1 MSE F – C1 divide SSC2 MSE F – C2

’Compute F values.

score F – TR DIST – TR score F – C1 DIST – FC1 score F – C2 DIST – FC2 end

’Keep running tally of F values.

COUNT DIST – TR[\4.42 C – TR divide C – TR 5000 PTR print C – TR PTR

’Count F values as extreme as those observed (Table 3). ’Compute p-value.

COUNT DIST – FC1[\0.03 C – FC1 divide C – FC1 5000 PC1 print C – FC1 PC1

’Do the same for each contrast.

COUNT DIST – FC2[\6.19 C – FC2 divide C – FC2 5000 PC2 print C – FC2 PC2 HISTOGRAM DIST – TR DIST – FC1 DIST – FC2

’Plot frequency distributions of F-values.

Program 3 The following Resampling Stats program conducts a test for homogeneity of variance of the mesocosm data listed in Table 2; significane is determined by the bootstrap method. maxsize default 5000 data (10 9 8 8 7) TRT1 – R data (10 9 8 8 8) TRT2 – R data (9 7 5 5 1) TRT3 – R data (4) Q1 data (3) Q2 data (12) Q3

’Input the survival values for each treatment (Table 2). ’Then concatenate the raw data. ’Input value for (ni!1) ’Note: This program assumes equal sample sizes. ’Input value for sum(ni!1)/(N!a) ’Input value of (N!a) ’Then concatenate the raw data.

contact TRT1 — R TRT2 — R TRT3 — R ALL — R repeat 5000 shuffle ALL – R ALL sample 5 ALL TRT1 sample 5 ALL TRT2 sampel 5 ALL TRT3 variance TRT1 S1sq variance TRT2 S2sq variance TRT3 S3sq

’Run 5000 randomizations. ’Shuffle the data. ’Sample 5 observations, with replacement, and assign to TRT12

’Compute sample variances for each treatment group2

REVIEW log S1sq LNS1sq log S2sq LNS2sq log S3sq LNS3sq add LNS1sq LNS2sq LNS3sq LNSIsq multiply LNSIsq Q1 B

’Compute the natural log of the sample variances2

add S1sq S2sq S3sq UCSPsq divide UCSPsq Q2 SPsq log SPsq LNSPsq multiply Q3 LNSPsq A

’Add sample variances. ’Multiply by sum (ni[1)/(N[a).

91

’Multiply by (ni[1).

subtract A B C score C Z end

’Compute the Bartlett statistic (equation [2]). ’Keep track of the values.

histogram Z count Z[\6.6259 Q divide Q 5000 P print Q P

’Plot histogram of values. ’Count the number exceeding the observed value. ’Calculate P-value.

Program 4 The following Resampling Stats program conducts a test of independence on the mesocosm data listed in Table 4; significance is determined by a randomization test. See Sokal and Rohlf (1981, p. 696) for a discussion of the G statistic used in this test. maxsize default 5000 data (10 9 8 8 7) LIVE1 – R data (10 9 8 8 8) LIVE2 – R data (9 7 5 5 1) LIVE3 – R data (50) ROWT concat LIVE1 – R LIVE2 – R LIVE3 – R ALL – R repeat 5000 shuffle ALL – R ALL take ALL 1,5 LIVE1 take ALL 6,10 LIVE2 take ALL 11,15 LIVE3 sum LIVE1 L1 sum LIVE2 L2 sum LIVE3 L3

’Input the survival values for each treatment (Table 2).

’Input ROW totals (i.e., 5]10). ’Concatenate survival counts. ’Run 5000 randomizations. ’Take the first 5 for TRT1 ’Take the second 5 for TRT22 ’Sum the number of live in each TRT.

concat L1 L2 L3 LIVER subtract ROWT LIVE DEADR add 0.0000001 DEADR DEAD add 0.0000001 LIVER LIVE sum LIVE TOTL sum DEAD TOTD add TOTL TOTD N

’Vectorize the total live counts. ’Compute vector of dead counts. ’Adjust 0 values so logarithms can be taken.

log log log log log log

’Compute logs of all values.

LIVE 1nLIVE DEAD 1nDEAD TOTL 1nTOTL TOTD 1nTOTD ROWT 1nROWT N 1nN

multiply LIVE 1nLIVE sum L1nL Q1 multiply DEAD 1nDEAD sum D1nD Q2 multiply TOTL 1nTOTL multiply TOTD 1nTOTD multiply ROWT 1nROWT multiply N 1nN Q6 add Q1 Q2 Q7 add Q3 Q4 Q5 Q5 Q5 Q8 subtract Q7 Q8 Q9

LlnL D1nD Q3 Q4 Q5

’Compute column totals2 ’Compute grand total.

’Compute f]1n(f) values for all.

92

REVIEW

add Q9 Q6 Q10 multiply 2 Q10 G Score G Z end

’Compute G-statistic.

’Plot frequency distribution of Gs. ’Count G values as extreme as that observed. ’Compute p-value.

HISTOGRAM Z COUNT Z[\16.333 C divide C 5000 P Print P

Program 5 The following Resampling Stats program uses simulation and the data in Table 4 to generate a sampling distribution for the G statistic (see Sokal and Rohlf, 1981, p. 696). maxsize default 5000 data (50) ROWT data (112) LIVEO data (150) N

’Input ROW totals (i.e., 5]10). ’Input no. of live in all samples. ’Input total sample size.

repeat 5000

’Run 5000 randomizations.

generate ROWT 1, N TRT1 generate ROWT 1, N TRT2 generate ROWT 1, N TRT3

’Generate a sample of 50 from ’values 1 through 150. Assign ’to each of the EUs.

count TRT1 between 1 LIVEO L1 count TRT2 between 1 LIVEO L2 count TRT3 between 1 LIVEO L3

’Count the number of live bugs in ’each EU, i.e., 1 to 112\live; [112\dead.

concat L1 L2 L3 LIVER subtract ROWT LIVER DEADR add 0.0000001 DEADR DEAD add 0.0000001 LIVER LIVE sum LIVE TOTL sum DEAD TOTD add TOTL TOTD N

’Vectorize the total live counts. ’Compute vector of dead counts. ’Adjust 0 values so that logarithms can be taken.

log log log log log log

’Compute logs of all values2

LIVE 1nLIVE DEAD 1nDEAD TOTL 1nTOTL TOTD 1nTOTD ROWT 1nROWT N 1nN

multiply LIVE 1nLIVE sum L1nL Q1 multiply DEAD 1nDEAD sum D1nD Q2 multiply TOTL 1nTOTL multiply TOTD 1nTOTD multiply ROWT 1nROWT multiply N 1nN Q6

L1nL

’Compute column tools2 ’Compute grand total.

’Compute f]1n(f) values for all.

D1nD Q3 Q4 Q5

add Q1 Q2 Q7 add Q3 Q4 Q5 Q5 Q5 Q8 subtract Q7 Q8 Q9 add Q9 Q6 Q10 multiply 2 Q10 G score G Z end

’Compute G-statistic.

HISTOGRAM Z COUNT Z[\16.333 C divide C 5000 P Print P

’Plot frequency distribution of Gs. ’Count G values as extreme as observed. ’Compute p-value.

REVIEW

93

Program 6 The following Resampling Stats program is a modification of Program 2 that accommodates a randomized complete block design. Note that input of the data by treatment and block is necessary only to verify that the program is initially generating the correct F values for the observed data. maxsize default 5000 data (10) T1B1r ’Input the survival values for each treatment. data (10) T2B1r data (9) T3B1r data (9) T1B2r data (9) T2B2r data (7) T3B2r data (8) T1B3r data (8) T2B3r data (5) T3B3r data (8) T1B4r data (8) T2B4r data (5) T3B4r data (7) T1B5r data (8) T2B5r data (1) T3B5r ’Input no. obs. In Treatment 1 . . . data (5) N — T1 data (5) N — T2 data (5) N — T3 data (3) N — B1 ’Input the no. of obs. In Block 1 . . . data (3) N — B2 data (3) N — B3 data (3) N—B4 data (3) N — B5 add N — T1 N — T2 N — T3 N data (2) DFTR ’Input the d.f. for treatments. data (4) DFBL ’Input the d.f. for blocks. multiply DFTR DFBL DFE ’Compute the d.f. error. concat T1B1r T1B2r T1B3r T1B4r T1B5r T2B1r T2B2r T2B3r T2B4r T2B5r T3B1r T3B2r T3B3r T3B4r T3B5r ALLr repeat 5000 shuffle ALLr RAND take RAND 1 T1B1 take RAND 2 T2B1 take RAND 3 T3B1 take RAND 4 T1B2 take RAND 5 T2B2 take RAND 6 T3B2 take RAND 7 T1B3 take RAND 8 T2B3 take RAND 9 T3B3 take RAND 10 T1B4 take RAND 11 T2B4 take RAND 12 T3B4 take RAND 13 T1B5 take RAND 14 T2B5 take RAND 15 T3B5 concat T1B1 T1B2 T1B3 concat T2B1 T2B2 T2B3 concat T3B1 T3B2 T3B3 concat T1B1 T2B1 T3B1 concat T1B2 T2B2 T3B2 concat T1B3 T2B3 T3B3 concat T1B4 T2B4 T3B4 concat T1B5 T2B5 T3B5 concat T1 T2 T3 ALL sum ALL GTOT square GTOT GTOTsq divide GTOTsq N CT

’Run 5000 randomizations. ’Shuffle the data. ’Assign each of the 15 observations to a treatment-block ’combination2

T1B4 T1B5 T1 T2B4 T2B5 T2 T3B4 T3B5 T3 B1 B2 B3 B4 B5

’Group by Treatment.

’Group by Block.

’Pool all data. ’Compute correction term, CT

94

REVIEW

square ALL ALsq sum ALLsq UCSST subtract UCSST CT SST

’Compute total sum of squares, SST2

sum T1 sumT1 sum T2 sumT2 sum T3 sumT3 square sumT1 sumT1sq square sumT2 sumT2sq square sumT3 sumT3sq divide sumT1sq N — T1 SS1 divide sumT2sq N — T2 SS2 divide sumT3sq N — T3 SS3 add SS1 SS2 SS3 UCTSS subtract UCTSS CT SSTR

’Compute treatment sum of squares, SSTR2

concat T1 T2 T12 sum T12 C1 add N — T1 N — T2 N — T12 square C1 C1sq divide C1sq N — T12 CTC1 add SS1 SS2 UCC1SS subtract UCC1SS CTC1 SSC1

’Compute contrast T1 vs T2, SSC1.

concat T1 T3 T13 sum T13 C2 add N — T1 N — T3 N — T13 square C2 C2sq divide C2sq N — T13 CTC2 add SS1 SS3 UCC2SS subtract UCC2SS CTC2 SSC2

’Compute contrast T1 vs T3, SSC2.

sum B1 sumB1 sum B2 sumB2 sum B3 sumB3 sum B4 sumB4 sum B5 sumB5 square sumB1 sumB1sq square sumB2 sumB2sq square sumB3 sumB3sq square sumB4 sumB4sq square sumB5 sumB5sq divide sumB1sq N — B1 SSB1 divide sumB2sq N — B2 SSB2 divide sumB3sq N — B3 SSB3 divide sumB4sq N — B4 SSB4 divide sumB5sq N — B5 SSB5 add SSB1 SSB2 SSB3 SSB4 SSB5 UCBSS subtract UCBSS CT SSB

’Compute Block sum of squares, SSB2

subtract SST SSTR SSB SSE

’Compute Error sum of squares, SSE.

divide SSTR DFTR MSTR divide SSB DFBL MSB divide SSE DFE MSE

’Compute mean squares, MSTR MSE2

divide divide divide divide

’Compute F values2

score score score score end

MSTR MSE F – TR MSB MSE F – BL SSC1 MSE F – C1 SSC2 MSE F – C2

F – TR F – BL F – C1 F – C2

DIST – TR DIST – BL DIST – C1 DIST – C2

HISTOGRAM DIST – TR COUNT DIST – TR [\10.83 C – TR divide C – TR 5000 PTR print PTR

’Keep running tally of F values2

’Plot frequency distribution of F. ’Count F values as extreme as those observed. ’Compute p-value.

REVIEW

95

HISTOGRAM DIST – BL COUNT DIST – BL[\5.35 C – BL divide C – BL 5000 PBL print PBL HISTOGRAM DIST – C1 COUNT DIST – C1[\0.07 C – C1 divide C – C1 5000 PC1 print PC1 HISTOGRAM DIST – C2 COUNT DIST – C2[\15.17 C – C2 divide C – C2 5000 PC2 Print PC2

Program 7 The following Resampling Stats program conducts a test of independence on the mesocosm data listed in Table 4, accounting for blocking in the experimental design. Significance is determined by a randomization test. See Sokal and Rohlf (1981, p. 696) for a discussion of the G statistic used in this test. maxsize default 5000 data (10) T1B1r ’Input the survival values for each treatment-block combination. data (10) T2B1r data (9) T3B1r data (9) T1B2r data (9) T2B2r data (7) T3B2r data (8) T1B3r data (8) T2B3r data (5) T3B3r data (8) T1B4r data (8) T2B4r data (5) T3B4r data (7) T1B5r data (8) T2B5r data (1) T3B5r ’Input row totals for TxS tables. data (50) ROWTR data (30) ROWBL ’Input row totals for BxS tables. concat T1B1r T1B2r T1B3r T1B4r T1B5r T2B1r T2B2r T2B3r T2B4r T2B5r T3B1r T3B2r T3B3r T3B4r T3B5r ALLr repeat 5000 shuffle ALLr RAND take RAND 1 T1B1 take RAND 2 T2B1 take RAND 3 T3B1 take RAND 4 T1B2 take RAND 5 T2B2 take RAND 6 T3B2 take RAND 7 T1B3 take RAND 8 T2B3 take RAND 9 T3B3 take RAND 10 T1B4 take RAND 11 T2B4 take RAND 12 T3B4 take RAND 13 T1B5 take RAND 14 T2B5 take RAND 15 T3B5 concat T1B1 T1B2 T1B3 T1B4 T1B5 LT1 concat T2B1 T2B2 T2B3 T2B4 T2B5 LT2 concat T3B1 T3B2 T3B3 T3B4 T3B5 LT3 concat T1B1 T2B1 T3B1 LB1 concat T1B2 T2B2 T3B2 LB2 concat T1B3 T2B3 T3B3 LB3 concat T1B4 T2B4 T3B4 LB4 concat T1B5 T2B5 T3B5 LB5 concat LT1 LT2 LT3 ALL

’Run 5000 randomizations. ’Shuffle the data. ’Assign each of the 15 observations to a treatment-block ’combination2

’Arrange by Treatment2

’Arrange by Block2

’Pool all data.

96

REVIEW

’Treatment]Survival: ’\\\\\\\\\\\\\\

’General the T]S table:

sum LT1 sumLT1 sum LT2 sumLT2 sum LT3 sumLT3

’Sum the number of live in each Treatment2

concat sumLT1 sumLT2 sumLT3 LTr subtract ROWTR LTr DTr add 0.0000001 DTr DT add 0.0000001 LTr LT sum LT sumLT sum DT sumDT add sumLT sumDT N

’Vectorize the total live counts. ’Compute vector of dead counts. ’Adjust 0 values so that logarithms can be taken.

log log log log log log

’Compute logs of all values2

LT 1nLT DT 1nDT sumLT 1nsumLT sumDT 1nsumDT ROWTR 1nROWTR N 1nN

multiply LT 1nLT L1nLT sum L1nLT Q1 multiply DT 1nDT D1nDT sum D1nDT Q2 multiply sumLT 1nsumLT Q3 multiply sumDT 1nsumDT Q4 multiply ROWTR 1nROWTR Q5 multiply N 1nN Q6 add Q1 Q2 Q7 add Q3 Q4 Q5 Q5 Q5 Q8 subtract Q7 Q8 Q9 add Q9 Q6 Q99 multiply 2 Q99 GTS

’Compute column totals2 ’Compute grand total.

’Compute f x ln(f) values for all2

’Compute G-statistic for T]S

’Block x Survival: ’\\\\\\\\\\\\

’Generate the B]S table:

sum sum sum sum sum

’Sum the number of live in each BLOCK2

LB1 LB2 LB3 LB4 LB5

sumLB1 sumLB2 sumLB3 sumLB4 sumLB5

’Vectorize the total live counts. concat sumLB1 sumLB2 sumLB3 sumLB4 sumLB5 LBr subtract ROWBL LBr DBr ’Compute vector of dead counts. add 0.0000001 DBr DB ’Adjust 0 values so that logs can be taken. add 0.0000001 LBr LB sum LB sumLB ’Compute column totals2 sum DB sumDB log log log log log

LB 1nLB DB 1nDB sumLB 1nsumLB sumDB 1nsumDB ROWBL 1nROWBL

multiply LB 1nLB L1nLB sum L1nLB Q10 multiply DB 1nDB D1nDB sum D1nDB Q11 multiply sumLB 1nsumLB Q12 multiply sumDB 1nsumDB Q13 multiply ROWBL 1nROWBL Q14 add Q10 Q11 Q15 add Q12 Q13 Q14 Q14 Q14 Q14 Q14 Q16 subtract Q15 Q16 Q17 add Q17 Q6 Q18 multiply 2 Q18 GBS

’Compute logs of all values2

’Compute f x 1n(f) values for all2

’Compute G-statistics for B]S.

REVIEW

97

score GTS GTR score GBS GBL end

’Output the 5000 pairs of G-statistics for T]S and B]S to a file. Read them into SAS and ’Count the number of times where G(T]S) is greater than or equal to 16.333 AND G(BxS) is ’greater than or equal to 19.229. write file ‘‘c:TdataTresampleTmesocosmTcontin3.out’’ GTR GBL

REFERENCES Bradley, J. V. (1968). Distribution-Free Statistical ¹ests. Prentice—Hall, Englewood Cliffs, NJ. Conover, W. J. (1980). Practical Nonparameteric Statistics. Wiley, New York. Diaconis, P., and Efron, B. (1983). Computer-intensive methods in statistics. Sci. Am. 248, 116—128. Draper, N. R., and Smith, H. (1981). Applied Regression Analysis. Wiley, New York. Edgington, E. S. (1980). Randomization tests. In Statistics: ¹extbooks and Monographs. Marcel Dekker, New York. Edwards, D., and Berry, J. J. (1987). The efficiency of simulation-based multiple comparisons. Biometrics 43, 913—928. Efron, B. (1979a). Computers and the theory of statistics: Thinking the unthinkable. Soc. Ind. Appl. Math. 21, 460—480. Efron, B. (1979b). Bootstrap methods: Another look at the jackknife. Ann. Stat. 7, 1—26. Efron, B. (1982). ¹he Jackknife, the Bootstrap, and Other Resampling Methods, CBMS—NSF Monograph 38. Society for Industrial and Applied Mathematics. Everitt, B. S. (1977). ¹he Analysis of Contingency ¹ables. Chapman & Hall, New York. Fisher, R. A. (1935). ¹he Design of Experiments. Oliver & Boyd, Edinburgh, U.K. Henry, C. J., Higgins, K. F., and Buhl, K. J. (1994). Acute toxicity and hazard assessment of Rodeo , X-77 Spreader , and Chem-Trol to aquatic invertebrates. Arch. Environ. Contam. ¹oxicol. 27, 392—399. Hicks, C. R. (1982). Fundamental Concepts in the Design of Experiments. Saunders College, Fort Worth, TX. Hurlbert, S. H. (1984). Pseudoreplication and the design of ecological field experiments. Ecol. Monorgr. 54 (2), 187—211. Johnson, R. A., and Wichern, D. W. (1988). Applied Multivariate Statistical Analysis. Prentice—Hall, Englewood Cliffs, NJ. Kemble, N. E., Brumbaugh, W. G., Brunson, E. L., Dwyer, F. J., Ingersoll, C. G., Monda, D. P., and Woodward, D. F. (1994). Toxicity of metalcontaminated sediments from the Upper Clark Fork River, Montana, to aquatic invertebrates and fish in laboratory exposures. Environ. ¹oxicol. Chem. 13, 1985—1997. Kempthorne, O. (1952). The randomization theory of experimental inference. J. Am. Stat. Assoc. 50, 946—967. Keppel, G. (1973). Design and Analysis: A Researcher’s Handbook. Prentice—Hall, Englewood Cliffs, NJ. Kirk, R. E. (1978). Introductory Statistics. Brooks—Cole, Belmont, CA. Kreutzweiser, D. P. (1997). Nontarget effects of neem-based insecticides on aquatic invertebrates. Ecotoxicol. Environ. Saf. 36, 109—117.

Magnussen, S., and Burgess, D. (1997). Stochastic resampling techniques for quantifying error propagations in forest field experiments. Can. J. For. Res. 27, 630—637. Manly, B. F., McAlevey, J. L., and Stevens, D. (1986). A randomization procedure for comparing group means on multiple measurements. Br. J. Math. Stat. Psychol. 39, 183—189. Manly, B. F. J. (1991). Randomization and Monte Carlo Methods in Biology. Chapman & Hall, New York. Manly, B. F. J. (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall, New York. Milliken, G. A., and Johnson, D. E. (1992). Analysis of Messy Data, Vol. 1: Designed Experiments. Van Nostrand Reinhold, New York. Montgomery, D. C. (1991). Design and Analysis of Experiments. Wiley, New York. Moulton, L. H., and Zeger, S. L. (1989). Analyzing repeated measures on generalized linear models via the bootstrap. Biometrics 45, 381—394. Neter, J., Wasserman, W., and Kutner, M. H. (1990). Applied ¸inear Models. Richard D. Irwin, Homewood, IL. Petrondas, D. A., and Gabriel, K. R. (1983). Multiple comparisons by rerandomization tests. J. Am. Stat. Assoc. 78, 949—957. Pitman, E. J. G. (1937). Significance tests which may be applied to samples from any populations: III. The analysis of variance test. Biometrika 29, 322—335. Potvin, C., and Roff, D. A. (1993). Distribution-free and robust statistical methods: Viable alternatives to parameteric statistics? Ecology 74, 1617—1628. Schreuder, H. T., Gregoire, T. G., and Wood, G. B. (1993). Sampling Methods for Multiresource Forest Inventory. Wiley, New York. Shapiro, S. S., and Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika. 52, 591—611. Simon, J. L. (1992). Resampling: ¹he New Statistics. Resampling Stats, Inc., Arlington, VA. Sokal, R. R., and Rohlf, F. J. (1981) Biometry, the Principles and Practices of Statistics in Biological Research. Freeman, New York. Solomon, F. (1987). Probability and Stochastic Process. Prentice—Hall, Englewood Cliffs, NJ. Sparks, T. H., and Rothery, P. (1996). Resampling methods for ecotoxicological data. Ecotoxicology 5, 197—207. Steel, R. G. D., and Torrie, J. H. (1980). Principles and Procedures of Statistics. McGraw-Hill, New York. Ury, H. K. (1976). A comparison of four procedures for multiple comparisons among means (pairwise contrasts) for arbitrary sample sizes. ¹echnometrics 18, 89—97. Winer, B. J., Brown, D. R., and Michels, K. M. (1991). Statistical Principles in Experimental Design. McGraw—Hill, New York.