Randomization Tests: Example Using Morphological Differences in ...

FORUM

Randomization Tests: Example Using Morphological Differences in Aphis gossypii (Homoptera: Aphididae) T. A. EBERT, 1 W. S. FARGO, B. CARTWRIGHT, AND F. R. HALL 1 Department of Entomology, 127 Noble Research Center, Oklahoma State University, Stillwater, OK 74078

Ann. Entomol. Soc. Am. 91(6): 761-770 (1998)

ABSTRACT Morphometric data of Aphis gossypii Glover are used as a case study to illustrate the use of randomization tests. The application of randomization tests in morphological evaluation and identification of species is a powerful tool for characterizing populations and species. It offers the advantage of reducing our reliance on the robustness of more classical approaches to overcome problems of small sample size, unequal sample size, and departures from normality. We review randomization test methodology. We address a few errors that have appeared in the literature. One question is how many randomizations. As a generic starting point, the number of randomizations should be 2 orders of magnitude larger than the inverse of the significant P value, but in critical cases an exact figure can be determined. A new methodology is introduced for using randomization tests to determine if the average of several observations is different from a constant. An extension of the method is used when the null hypothesis states that there are differences. This is important where there is reason to suspect that one is dealing with different populations (e.g., morphological measurements were taken from several distinct populations of an insect) and one needs to identify which populations are the same. This test should not be confused with the typical case where it is simply impossible to identify differences between different sets of observation. We present a SAS program to perform 2-tailed tests for differences between means. KEY WORDS Aphis gossypii, randomization tests, morphometric analysis

the power of computers to overcome some of the limitations in data analysis imposed by classical statistics. The advantage of randomization methods stems from the fact that they are more flexible than standard tests and they use the distribution of the data as opposed to the hypothetical distribution of the population. In many cases, parametric tests (e.g., the Student t-test) are sufficient, and deviations from the assumptions of the model are too small to affect conclusions. However, violations of the assumptions in parametric models are important in cases where the exact significance level is an important element in the analysis and conclusions. Randomization tests are also useful in cases where conventional tests are inappropriate because of small sample size or experimental design (Crowley 1992). There are a few errors in the published literature that are an integral part of the literature review, but which we discuss and correct in subsequent sections. Three assumptions are necessary for making a valid randomization test. First, observations should be independent. Second, samples should be a random subset from the population of interest (Manly 1991). Third, observations come from the same (but unknown) distribution. This assumption is necessary because these tests are sensitive to differences in variance and other moments (e.g., skewness and kurtosis) (Crowley 1992). RANDOMIZATION TESTS USE

1

LPCAT-OARDC-OSU 1680 Madison Avenue, Wooster, OH 44691.

For clarity, we provide a brief outline of the article. To make the methodology accessible to everyone, and as a reference for expanding the methodology, we begin by looking at how to design and implement a randomization test. Next we describe the data for the example, and how the data are analyzed. For comparison, a brief introductory analysis is presented which uses conventional statistical methodology. The discussion of the randomization methodology begins by dealing with the number of randomizations required to achieve accurate results. Because different authors recommend different numbers of randomizations with no method for determining which recommendation is appropriate, we felt that some technique should be developed to provide the data analyst with a strategy for determining the appropriate number of randomizations for any situation. Wefinishthis effort with a method for determining the accuracy of the analysis. Next, 2 sections expand on the randomization methodology. The 1st shows how to compare a sample mean to a constant. The 2nd describes a method to extract additional information from data by setting up and testing a null-hypothesis which states that there are differences. The discussion section summarizes the biological results, and provides some concluding remarks on the methodology. Although this article is mostly a worked example, the example itself should not be entirely eclipsed. Aphis gossypii Glover (melon aphid) is an important pest of > 100 crops worldwide. It causes economic loss

762

ANNALS OF THE ENTOMOLOGICAL SOCIETY OF AMERICA

to plants in Cucurbitaceae (squash, melons) and Malvaceae (okra, cotton) through direct feeding. It also transmits viral pathogens in many crops (Ebert and Cartwright 1997). With emphasis on pest control shifting toward ecologically sound methods, it becomes increasingly important to have quick methods for accurately analyzing the current and future population structure of a pest. This aphid undergoes a morphological transformation in response to environmental conditions. In the summer it is yellow and small, and in the cooler months it is large and green (Ebert and Cartwright 1997). Because it is well known that size and reproductive potential are highly correlated, color may be useful in monitoring this aphid's population growth potential. Therefore, the example we chose for illustrating the implementation of a randomization test was to look for differences and similarities between yellow and green color morphs of the melon aphid. Design and Implementation. Randomization tests are used to examine differences in some statistic (e.g., mean, mahalanobis distance, slope of a regression line) between 2 or more sets of observations. The sets of observations can be from 2 populations of organisms or from different treatments in an experiment. The logic behind these tests is simple. The standard nullhypothesis states that there are no differences between sets of observations. If one assumes that the null-hypothesis is true, then it makes no difference where an observation comes from. By treating all data as if they were collected as random samples from a single population, one can directly determine the probability that one would find a difference equal to or greater than the observed difference given that the null-hypothesis is true. As with more conventional tests, one rejects the null-hypothesis if this probability is small. The 1st step in performing the test is to pool all the observations and then randomly and without replacement reassign each observation to each treatment level such that all treatments have the same number of observations as they had in the original data set (Edgington 1987, Sokal and Rohlf 1995). The statistic of interest (e.g., mean) is then computed, the difference between treatments calculated, and the process repeated. If the total number of possible arrangements of the data is small, then all arrangements are used and it is called a permutation test. If the number of arrangements is large, then a random subset of all possible arrangements is used and the test is called a "sampled randomization test" (Crowley 1992) or an "approximate randomization test" (Noreen 1989). The total number of possible permutations of the data is calculated as

where NP is the number of permutations, and n{ is the number of observations in the ith group. Suggestions for the number of randomizations typically range from

Vol. 91, no. 6

1,000 to 5,000 (Noreen 1989, Manly 1991, Potvin and Roff 1993), but Jackson and Somers (1989) recommend using 10,000 to 50,000. Although this may appear to be a large number of randomizations, consider our worked example: 20 yellow and 20 green aphids were measured, therefore NP = 1.3785 by 1011, so 50,000 randomizations is only 0.0004% of NP. In approximate randomization tests, one wants first to randomize the order of the data. The alternative approach of using the observed distribution as 1 of the randomizations produces a bias in the results. Consider, if one only uses 2 randomizations, and 1 of them is the observed distribution, then one will find at least 1 value equal to or larger than the observed difference. This bias decreases as the number of randomizations increases, but it will always persist in approximate randomization tests where the calculations begin by using the data in the observed order, or some other nonrandomly determined order. The P value is calculated as (nge) + (n+l), where nge is the number of differences greater than or equal to the observed difference, and n is the total number of randomizations used in the analysis (Manly 1991). Because the randomization procedure is approximate rather than exact, there is some error expected in the estimate of the P value. The standard deviation for the P value is estimated as V p ( l — p) + n (Potvin and Roff 1993), where p is the calculated probability from the frequency distribution of differences, and n is the number of randomizations of the data. Finally, randomization tests have an unusual limitation. They are restricted to comparing 2 or more groups. To compare 1 group to a constant, randomization tests are not useful for the simple reason that there is nothing to randomize (Manly 1991). As Manly points out, Fisher randomization test is one way around the problem. However, that test is not a randomization test as the term is used in this text (or in Manly's book) because the observations themselves are not reordered randomly. There is 1 aspect of approximate randomization tests that has nothing to do with the test, but is critical in the implementation. Because these tests depend on a large number of random numbers, the quality of the random number generator (RNG) is critical to the accuracy of the test. Most RNGs have a periodic bias in their random numbers, better RNGs having longer periods (Ferrenberg et al. 1992, Grassberger 1993). Currently, no RNG is universally accepted. The problem is complex and is made more difficult because there may be interactions between the kind of random number generator used and the statistical procedure used for analysis (Ferrenberg et al. 1992). A reasonable solution to the problem is to use a well-known RNG and report exactly what type was used along with any modifications used in improving the RNG. The Ranuni(O) function in SAS (SAS Institute 1989) was used for the randomizations in the analyses for this article. This random number generator is a linear congruential RNG using modulo 231-1 with a multiplier of 397204094, and is described further in Fishman and Moore (1982,1986). Persons trying this analysis using

November 1998

EBERT ET AL.: RANDOMIZATION TESTS IN MORPHOMETRICS

a standard computer programming language are encouraged to write their own random number generator or to write a subroutine that reduces problems associated with many personal computer random number generators (see Knuth 1981; Press et al. 1986, 1988, and references therein). To illustrate the method, an approximate randomization procedure is used to examine differences in morphological characters between 2 color morphs (yellow and green) of the melon aphid A. gossypii. This aphid undergoes a seasonal color change. During cooler months in the spring and fall, the entire population is dark green. During warmer summer months, the aphid is smaller and yellow. However, even during the summer months some fraction of the population remains green. Biologically it could be useful to know how accurately a difference in color reflects a difference in aphid size—it is far easier to group aphids by color than it is to measure all of them. If color and size are highly correlated, this relationship might make following an aphid population in the field easier because size is highly correlated with reproductive potential in aphids (Llewellyn and Brown 1985). It would also be useful to know which measurements are most sensitive to the color difference, and to see if any of the measurements are characteristic of A. gossypii as a species.

763

ogy as described in the introduction. Where necessary, additional detail is provided in each section. An approximate randomization procedure is required because there are too many permutations of 40 observations for a personal computer to perform a permutation test in any reasonable length of time. Programs in FORTRAN and Basic to perform randomization tests have already been published (Edgington 1987, Noreen 1989, Manly 1991). Such programs are good for experienced programmers, but all phases of the analysis must be written and checked for programming errors. This includes the parametric or nonparametric test statistic used in comparing the sets of observations, the random number generator, and the randomization routine. The advantage of the program in Appendix 1, is that standard SAS routines do most of the work which makes the program easier to use, and easier to adapt for other situations. It has been stated before that randomization tests are sensitive to the distribution of the data about the expected value of the statistic (in this case the mean). In the spirit of the randomization test, one would like to examine all possible distributions of the data about this statistic. To empirically study this effect we observe that there are 2 occurrences which could produce 2 groups with identical means but different distributions. First, the dispersion of numbers (variance) about the mean could increase. This effect will rapidly result in rejection of any difference between the 2 Materials and Methods groups. Second, the data could be skewed such that a All aphids were reared on a single watermelon plant large number of observations close to the mean bal[Citmllus lanatus (Thunb.) Matsum. & Nakai 'Jubi- ance a few extreme observations, or 1 subgroup of lee'] in a walk-in growth chamber held at 25 ± 0.4°C: similar observations could balance another subgroup 23 ± 0.4°C corresponding to a photoperiod of 16:8 such that the observations within subgroups are closer (L:D) h. The relative humidity was 58 ± 10%. Light than between subgroups but they balance to form the was provided by both fluorescent and incandescent mean. To carefully look at the effects of skewness (and light bulbs with a light intensity of 4.09 /xmol s-1 m-2 other moments), we created different data sets using at 660 nm, and 0.853 /xmol s-1 m-2 at 730 nm at the level random numbers from the SAS random number genof the pot (chlorophyll is most sensitive to wave- erator RANNOR(O). Eight calls to this generator were lengths at 660 nm and 730 nm, and melon aphids are made to produce each data set. All data sets have a sensitive to different wavelengths [Wyatt and Brown mean of 0.00 ± 0.05 with a standard deviation of 1977]). The plant was infested with 5 adult apterous 0.377 ± 0.06. Sets of 8 numbers that did not conform aphids from a parent colony reared on the same cul- to these limits were not used. Additional constraints tivar of watermelon. The colony was allowed to de- were imposed on the generator to mimic the effects of velop for 2 wk and then adult apterous aphids were subgroups (or clusters of observations). This was done removed, sorted by color into yellow or green, and by setting 1 or more of the 8 data values to equal one frozen. Voucher specimens were retained by M. Stoet- another, and then letting the random number generzel, and additional specimens have been deposited in ator sifter find sets of numbers which fit the selection the K. C. Emerson Entomology Museum (Oklahoma criteria. All possible combinations of identical observations of 8 numbers were examined. A 95% CI was State University, Stillwater, OK). produced using a t-test and a permutation test. The Morphological characters from 20 adult apterous terminology used in Table 5 is similar to that used in aphids of each color morph were measured using an card games like poker: a pair is 2 data points which are Olympus Stereoscopic microscope with an ocular mi- the same plus 6 other numbers to keep the mean and crometer calibrated to 0.0167 mm. Body length was standard deviation within the defined limits (e.g., 0.23, measured from the tip of the cauda to the extreme 0.23,0.11, -0.12, -0.11,0.20,0.10, -0.87). Likewise, a frontal part of the head as suggested by Ilharco and van pair and 3 of a kind might be something like 4.1, 4.1, Harten (1987). Length of metathoracic tibia, length of —0.2, —0.2, —0.2, and 3 other numbers to keep the cornicles (=siphunculi), and the greatest distance be- mean and standard deviation within the limits. tween the outer margins of the compound eyes were In examining the data in different ways (as in Tables also measured. Approximate randomization tests were performed 2-4), we needed to estimate the error associated with using the program in Appendix 1, with the methodol- doing approximate randomization tests. This was done

764

ANNALS OF THE ENTOMOLOGICAL SOCIETY OF AMERICA

Vol. 91, no. 6

such that their difference is exactly zero should be fewer than the number of ways to arrange them such that there is a small difference. Second, the lines connecting dots do nothing more than clarify the arrangeMorph Model 2 Parameter, mm Green (n = 20) Yellow (n = 20) ment of data points. A randomization test does not r P>F result in a continuous function no matter how large 1.149 ± 0.129 1.335 ± 0.142 0.33 0.0001 Body the data set or how many iterations. Because of the 0.52 0.0001 0.573 ± 0.082 Tibia 0.715 ± 0.056 randomness in successive data points shown in Fig. 1, 0.267 ± 0.037 0.186 ± 0.030 0.60 0.0001 Cornicle 0.334 ± 0.017 0.54 0.0001 0.301 ± 0.014 Eye it would be inappropriate to interpolate between suc1.868 ± 0.148 0.16 0.0097 Body/Tibia 2.025 ± 0.210 cessive points to achieve an "exact" distance. If there 5.082 ± 0.761 0.40 0.0001 6.246 ± 0.695 Body/Cornicle are no randomizations resulting in the target value, 3.994 ± 0.363 0.07 0.1062 Body/Eye 3.810 ± 0.340 0.41 0.0001 Tibia/Cornicle 2.713 ± 0.286 3.086 ± 0.149 then one selects the next most restrictive level. This 0.39 0.0001 2.139 ± 0.117 1.893 ± 0.191 Tibia/ Eye applies to both significance levels and confidence in0.52 0.0001 0.798 ± 0.104 0.616 ± 0.074 Cornicle/Eye tervals. Number of Randomizations. In choosing the number of randomizations, one should remember that by repeating the test 10 times at the specified number these tests produce estimates of the true difference of randomizations. This was used to examine the varibetween sets of observations and that the number of ability in the test results and how the number of randomizations reduces the level of uncertainty in randomizations affected the error. Introductory Results. A preliminary examination of that estimation. However, the beauty of these tests is the data showed that the green color morph of A. that one can estimate the error and estimate the opgossypii was larger than the yellow morph under the timal number of randomizations necessary for any conditions in the growth chamber (Table 1). This level of accuracy. For example, if one needs to know difference was significant for all characters at P ^ what difference in body length between yellow and 0.0001 for univariate models with color as the depen- green aphids would have been significant at the a = dent variable and morphological characters as inde- 0.10 level, one enters the data into the program in pendent variables. Ratios of the different morpholog- Appendix 1. One initially does 1,000 randomizations, ical measurements also showed significant differences but then there are questions about the accuracy of the between the 2 color morphs with the exception of the results. One then creates a table similar to Table 2 and ratio of body/eye. Of the models examined, the size of does a regression analysis predicting the standard dethe cornicles provided the clearest separation be- viation based on the number of randomizations. For tween yellow and green color morphs. With the ex- body length, the equation is log (standard deviation) = ception of the ratio of body/eye, all measures show a —2.9887 - (0.4442 * log[number of randomizations]) significant difference between yellow and green color (r^ = 0.998, P > F = 0.0008). This approach is useful if the specific level of acmorphs. The program in Appendix 1 produces a frequency curacy is important, or if controversy develops over distribution of differences between the means of 2 the specific number of randomizations. However, groups (Fig. 1). Two features illustrated in Fig. 1 are most applications are not so demanding, and a rule of important. First, these frequency distributions usually thumb would be useful. Accuracy is measured by how show that the difference occurring with the greatest many randomizations are necessary to eliminate any frequency is slightly larger than zero. This result is change in a specific decimal place after adding twice intuitively obvious because one expects that the num- the standard deviation to the mean. For the variable ber of ways to arrange any set of numbers into 2 groups cornicle at a = 0.10, the answer is 10 for the 1st decimal place if one adds twice the standard deviation to the average (Table 2). It is 1,000 for the 2nd decimal place and 10,000 for the 3rd. In other words, it is about an order of magnitude greater than the inverse of the significant digit (2nd decimal place is 10 * [1/0.01], 3rd is 10 * [1/0.001]). At a = 0.01, a similar pattern holds, but it now takes 10 times as many randomizations to achieve the desired level of accuracy. The pattern is approximately the same for all of the variables, which would lead one to conclude that one needs at least (I/a) X (I/significant decimal place) number of randomizations. Although it is usually sufficient to be certain of significance (or accuracy) at the hundredths place, a further simplified rule of 0.05 0.1 0.15 thumb might be to use 2 orders of magnitude more Absolute Difference In mm randomizations than the inverse of alpha. Thus, a Fig. 1. Frequency distribution for absolute difference in claim of having an alpha of 0.01 should be backed up by at least 10,000 randomizations. mean body length. Table 1. Morphometric characters of different color niorphs from adult apterous aphids reared on a single watermelon plant (mean ± SD)

November 1998 Table 2. 0.01

765

Error in estimating a significant mean difference (in mm) for variables cornicle, body, and eye given alpha = 0.10 and alpha = Cornicle C V " = 23.21 CV Difference ± S D

Alpha 0.10

EBERT ET AL.: RANDOMIZATION TESTS IN MORPHOMETRICS

Body C V = 13.18 CV Difference ± S D

Eye C V = 7.16 CV Difference ± S D

10 100 1,000 10,000

0.03842 ± 0.01265 0.02725 ± 0.00218 0.02733 ± 0.00088 0.02717 ± 0.00026

32.92 8.00 3.23 0.97

0.10783 ± 0.01900 0.08117 ± 0.00618 0.08517 ± 0.00225 0.08500 ± 0.00088

17.62 7.61 2.64 1.03

0.01200 ± 0.00154 0.01150 ± 0.00097 0.01183 ± 0.00040 0.01208 ± 0.00000

12.87 8.40 3.40 0.00

100 1,000 10,000

0.04617 ± 0.00648 0.04242 ± 0.00181 0.04217 ± 0.00061

14.05 4.26 1.46

0.13867 ± 0.01451 0.12933 ± 0.00512 0.13083 ± 0.00136

10.46 3.96 1.04

0.01883 ± 0.00169 0.01792 ± 0.00056 0.01842 ± 0.00043

8.96 3.10 2.34

0.01

" The CV listed across the top are for the data set (both color morphs). The CVs listed in columns are for the differences.

A similar exercise can be done if for assessing the accuracy in the P value of a specific difference (Table 3). However, in this table an objective rule of thumb is harder to derive. Using the same criteria as for Table 2, there is no change in the tens place at 10-100 randomizations. However, it takes 10,000 randomizations to achieve accuracy in the hundredths place at a difference of 0.1392 mm and more than that for the smaller difference. It would appear that for a specific distance between sets of observations, one needs at least 10*(1/[significant decimal place]), but that it may take >10 times this value. Trying to be as generic as possible, this suggests that one should use at least 100* (1/ [significant decimal places]) randomizations. A final word of caution about significance and the number of randomizations. A recent revision of the biometry text by Sokal and Rohlf (1995) reported a P value of

Randomization Tests: Example Using Morphological Differences in ...

Randomization Tests: Example Using Morphological Differences in ...

Suggest Documents

Randomization tests in language typology - Semantic Scholar

situational judgment tests - example

Randomization Tests for Relational Learning - Purdue CS

Resampling Methods: Randomization Tests, Jackknife and Bootstrap ...

Randomization Tests for Distinguishing Social ... - Conferences

[PDF] Randomization Tests, Fourth Edition - Google Sites

Randomization Tests under an Approximate Symmetry Assumption

Randomization Tests under an Approximate Symmetry Assumption

Randomization Tests under an Approximate Symmetry Assumption

Randomization Tests for Distinguishing Social Influence and ...

Randomization tests and the unequal - Brendan Johns

Randomization Tests for Distinguishing Social Influence and ...

RANDOMIZATION TESTS FOR SMALL SAMPLES: AN ...

Are there differences in strength tests using isokinetic dynamometry ...

Using Randomization in Development Economics ... - Semantic Scholar

Using Randomization in Development Economics Research

Mendelian randomization in health research: Using ...

Morphological differences in the craniofacial ... - Semantic Scholar

Morphological differences in the lateral geniculate ...

Differences in Morphological, Physiological and Growth Traits ...

Typological Differences in Morphological Patterns, Gender ... - MDPI

Typological Differences in Morphological Patterns ...

Differences in morphological and physiological leaf characteristics ...

Morphological differences in the lateral geniculate ... - CiteSeerX