Resampling-based software for estimating optimal ... - Semantic Scholar

61 downloads 3271 Views 443KB Size Report
Apr 5, 2007 - Name of Software: SISSI (Shortcut In Sample Size Identification). ... www.elsevier.com/locate/envsoft .... Field Crops Research 97, 135e141.
Environmental Modelling & Software 22 (2007) 1796e1800 www.elsevier.com/locate/envsoft

Short communication

Resampling-based software for estimating optimal sample size R. Confalonieri a,*, M. Acutis b, G. Bellocchi c, G. Genovese a a

European Commission Directorate General Joint Research Centre, Institute for the Protection and Security of the Citizen, Agriculture and Fisheries Unit, via E. Fermi 1-TP 268, I-21020 Ispra (VA), Italy b University of Milan, Department of Crop Science, via Celoria 2, I-20133 Milan, Italy c European Commission Directorate General Joint Research Centre, Institute for Health and Consumer Protection, Biotechnology and GMOs Unit, via E. Fermi 1-TP 331, I-21020 Ispra (VA), Italy Received 28 June 2006; received in revised form 14 February 2007; accepted 28 February 2007 Available online 5 April 2007

Abstract The SISSI program implements a novel approach for the estimation of the optimal sample size in experimental data collection. It provides a visual evaluation system of sample size determination, derived from a resampling-based procedure (namely, jackknife). The approach is based on intensive use of the sample data by systematically taking sub-samples of the original data set, and calculating mean and standard deviation for each of subsamples. This approach overcomes the typical limitations of conventional methods, requiring data-matching statistical assumptions. Visual, easyto-interpret provisions are supplied to display the variation of means and standard deviations as size of generated samples increases. An automatic option for identification of optimal sample size is given, targeted at the size for which the rate of change of means becomes negligible. Alternatively, a manual option can be applied. An ideal application of SISSI is in supporting the collection of plant and soil samples from field-grown crops, but it also holds potential for more general application. SISSI is developed in Visual Basic and runs under the Windows operating systems. The installation software package includes the executable files and a hypertext help file. SISSI is freely available for non-profit applications. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Optimal sample size; Resampling; SISSI; Visual Basic; Visual jackknife

Software availability Name of Software: SISSI (Shortcut In Sample Size Identification). Developer: Roberto Confalonieri. Contact Address: JRC-IPSC, Ispra, Italy. Tel.: þ39 0332 789872. Fax: þ39 0332 789029. E-mail address: [email protected]. Availability: http://agrifish.jrc.it/marsstat/warm/archive.htm. 1. Introduction The sample size determination is a key issue in order to obtain reliable measurements of data that are representative * Corresponding author. Tel.: þ39 0332 789 872; fax: þ39 0332 789 029. E-mail address: [email protected] (R. Confalonieri). 1364-8152/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.envsoft.2007.02.006

of the system under study (Clayton and Slack, 1988; Delisle et al., 1988; Nath and Singh, 1989; Tsegaye and Hill, 1998; Windmeijer et al., 1998). Preliminary sampling aimed at determining the best sample size should be performed. Brockett (2006), for instance, in his study on vegetation composition assessment across plant communities, identified 150e200 pointobservations per plot as best possible sample size using 2000 point-observations in a preliminary sample. Preliminary samples are, however, usually not carried out because of resource limitations (Ambrosio et al., 2004). In practice, the sample size is often chosen subjectively from a wide range of values available in the literature for similar experimental conditions (Frederick and Camberato, 1995; Jamieson et al., 1998; Serrano et al., 2000; Yuan et al., 2000; Olesen et al., 2002). Software tools do actually exist for computer-aid support to sample size determination. Power and Precision (Borenstein et al., 2001) is an example of stand-alone program to help find an appropriate balance among study design, assumptions

R. Confalonieri et al. / Environmental Modelling & Software 22 (2007) 1796e1800

and statistical power. Similar features are provided by StudySize (Oloffson, 2003) and Sample Size Calculator (Survey System, 2003). They all largely rely on statistical assumptions of data distribution, which are not likely to occur when working with biological samples. In particular, conventional inferential methods based on the Student-t distribution are of no use without knowing population characteristics. A parametric approach to sample size determination is also implemented in the software package SeedCalc for use in seed testing plans for purity/impurity determination (Remund et al., 2001). The software KeSTE (Paoletti et al., 2003) was developed in the context of genetically modified organism detection in kernel lots, to investigate the sampling error associated with different sample sizes without the need of distributional assumptions. However, special parameters are required by the software to reconstruct kernel populations and the specific target of the software does not allow its application to other study areas. The variability inherent to samples and the uncertainty associated with factors affecting the measure of interest make an estimate of the acceptable sampling error for each experimental situation problematic. Moreover, the few observations commonly available from trials do not allow a reliable evaluation of the normality of distribution. Examples exist in the literature (e.g. Yonezawa, 1985; Tirol Padre et al., 1988; Wolkowski et al., 1988), where the decrease in variability of samples as sample size increases is analysed in place of assuming absolute criteria for determination of sample size. The approach known as ‘‘method of maximum curvature’’ or its modified forms (Lessman and Atkins, 1963; Thomas, 1974) uses the relationship between the coefficient of variation and a number of experimental units as a basis to estimate the optimum number of units, and may require complex exploratory experiments (De Oliveira et al., 2005). Suitable statistical procedures are available to exploit intensively the information contained in one sample only. Resampling methods (Efron, 1981; Efron and Tibshirani, 1991) offer this opportunity because they are based on the repeated use of the data from one sample. A particular resampling method, derived from the jackknife (Tukey, 1958; Hinkley, 1983) and recognized as visual jackknife, was recently proposed by Confalonieri (2004) as an alternative method for sample size determination. This approach is assumption-free and can be proficiently used when parametric statistics are of difficult application. Initially proposed for sampling biomass from rice fields and soil water content from maize fields, the same approach proved useful when applied to a variety of experimental conditions (Confalonieri et al., 2006). The basic idea of the jackknife sample generation is: first, the segregation of all sample data available into sets, and then the creation of sub-samples which are slightly reduced bodies of data obtained by leaving out one of the sets. Statistics are calculated on each of the virtual samples. The research of solutions by resampling methods widely uses the computational capabilities of computers, and dedicated software tools are required for their application. The objective of this paper is to describe the visual jackknife software implementation (namely, SISSI) for use in sample size determination.

1797

2. Software features A summary description of SISSI’s statistical bases and procedures follows. The procedures implemented in the software, the scientific background, and some principles of usage are illustrated in a fully documented hypertext help file. 2.1. Automatic sample size determination According to the jackknife approach, the original sample of N elements is divided into groups of k elements. N!/(N  k)!k! virtual samples (combinations without repetitions) of N  k elements are generated by eliminating at each iteration k different values from the original sample. The generation of virtual samples is repeated N  P 1 times with k assuming values from 1 to N  2, for a total of N2 k¼1 N!=ðN  kÞ!k! virtual samples. As N increases, the total number of virtual samples may become elevated, so it is recommended that an upper threshold is set (N ¼ 200e1000, as deduced from Efron and Tibshirani, 1993, and Manly, 1997), in order to limit the time required to generate samples. Means and standard deviations are computed for all the generated samples and plotted on two charts against sample sizes N  k (see example in Section 3). This allows a visual representation of how the means and the standard deviations of the generated samples vary as sample size increases. Reiterating the process using initial samples of different sizes would provide evidence for the possible sensitivity of the estimated solution to initial conditions. Analysing the trends of the means for increasing values of N  k, the optimum sample size is considered the N  k value for which the variability between the means does not significantly decrease with increasing sample size. Specifically, the automatic procedure consists of selecting (N  k)0 out of those N  k higher than 2 and lower than N  2. Four weighted linear regressions are performed over the generated means as follows: (i) the first runs over the highest values of the N  k  (N  k)0 ; (ii) whilst the second runs over the lowest values; (iii) the third runs over the highest values of the N  k > (N  k)0 ; and (iv) whilst the fourth runs over the lowest values. A global index (SR2) is computed by summing the coefficients of determination of the four regressions. The reiteration of the steps from i to iv for all the possible (N  k)0 allows the identification of the optimum sample size, that is (N  k)0 where SR2 is the highest. The process stops when the next sample size does not produce SR2 larger than 5% than the previous value. A trimming process allows the removal of extreme samples out of the computation (for instance, the 5% most extreme samples). 2.2. Manual sample size determination The user is allowed to adjust manually the resampling-estimated sample size. This is meant to further reduce sample size if the variability achieved (expressed as % coefficient of variation) is expected to be low enough to fall within what is considered by the researcher to be acceptable. Fig. 1 shows an example of the increase in the range of variability after manual sample size reduction.

R. Confalonieri et al. / Environmental Modelling & Software 22 (2007) 1796e1800

1798

Fig. 1. Automatic and manual sample size: comparison between the coefficients of variation.

2.3. Student-t based sample size determination The user may compare the resampling-based automatic sample size determination with the sample size estimated from a conventional approach based on two-tail, one-sample Student-t distribution (an application of the normality theory). In this case the user is requested to input the system with the maximum acceptable error (difference between sample and population mean) and an estimate of the variance that is known to be associated with the process under study (Fig. 2). Student-t determination is provided as a comparison term for the automatic estimate, but its validity is subject to the occurrence of normal data distribution. 2.4. Technical features The software for easy access to resampling-based runs in the context of sample size determination e SISSI e is a Visual Basic program running on Microsoft Windows operating systems (2000/XP). The procedures available in SISSI and the inputs/outputs are summarized in the UML (Unified Modelling Language) activity diagram of Fig. 3. A user-friendly interface allows users to easily manipulate input files, customize the resampling settings, execute resampling runs, calculate sample size, and produce reports. The program reads input data from Excel spreadsheets. The inputs required are numeric values input in a single column. Numeric and visual outputs are displayed by the interface. Results and settings about computational procedures can be exported into text files.

Fig. 3. UML activity diagram of SISSI.

procedures implemented, the scientific background and principles of usage. Comments about SISSI can be emailed to the software developer.

2.5. Availability and feedback

3. Illustrative example

SISSI is available free of charge for non-commercial purposes. The installation package is supplied, on request, to interested users. The program is documented by the accompanying user manual, which gives a detailed description of the

A numerical example was carried out aimed at understanding how basic jackknife principles are linked to the visual capabilities of SISSI. The example refers to Time Domain Reflectometer (TDR) percent soil water content measurements made at 0.12 m depth of semi-natural grassland (Milan, Italy). A preliminary sample was generated by taking single measurements from each of 49 1  1 m2 quadrats established in a w50 m2 plot. Graphs in Fig. 4 are isolated from the SISSI’s interface to show the distribution of means and standard deviations for different sample sizes (5000 virtual samples, no trimming). In this example, both means and standard deviations of generated samples tend to get steady once the sample size approaches 20. The horizontal bar in Fig. 4 allows for manual adjustment of sample size.

Fig. 2. Student-t based sample size determination.

R. Confalonieri et al. / Environmental Modelling & Software 22 (2007) 1796e1800

1799

a specific experimental frame, intrinsically promotes more general applicability in a straightforward way. The developers are committed to enriching the offerings of SISSI to keep up with evolving statistical methodology (including alternative optimization criteria). Acknowledgements Several people concerned in field and laboratory research have been and will continue to be involved with testing the software on diverse data, under a range of conditions. Special thanks are addressed to them. References

Fig. 4. Visual representation of jackknife means and standard deviations calculated from 49 TDR % soil water content measurements.

4. Concluding remarks The software program SISSI for determination of sample size in experimental research serves as a convenient means to support the implementation of sample plans of diverse complexity. A pre-sampling of N data is required (with N being large enough) and size reduction is achieved via jackknife technique until an optimization criterion is achieved. If the variability does not stabilize with size reduction, a larger N could be taken or, alternatively, considerations should be raised on the feasibility of the experiment when the number of data to be taken becomes too large. Visual jackknife helps in such a respect, with the automatic procedure being only one of the options allowed. A study on the variation coefficients by the manual option would help scientists to get a realistic sample size, balancing between the associated variability and experimental practicability. SISSI was first and foremost developed for supporting experiments in the agronomic domain. The use of it by users of all kinds of statistical background has shown that this tool has the flexibility and the simplicity to accomplish the main needs of those who are required to conduct experimental research. Thanks to its features and capabilities, i.e. independence from statistical assumptions, customization of resampling strategy, and display of relevant statistics, SISSI is attractive and interesting for sample size determination in areas other than the one for which it was developed and first applied. Its design, rather than being targeted to use in

Ambrosio, L., Iglesias, L., Marin, C., del Monte, J.P., 2004. Evaluation of sampling methods and assessment of the sample size to estimate the weed seedbank in soil taking into account spatial variability. Weed Research 44, 224e236. Borenstein, M., Rothstein, H., Cohen, J., 2001. Power and Precision. Biostat Inc, Englewood NJ, USA. Brockett, B.H., 2006. Pilot survey to assess sample size for herbaceous species composition assessments using a wheel-point apparatus on the Zululand coastal plain. African Journal of Range & Forage Science 23, 153e157. Clayton, M.K., Slack, S.A., 1988. Sample size determination in zero tolerance circumstances and the implications of stepwise sampling: bacterial ring rot as a special case. American Potato Journal 65, 711e723. Confalonieri, R., 2004. A jackknife-derived visual approach for sample size determination. Rivista Italiana di Agrometeorologia 1, 9e13. Confalonieri, R., Stroppiana, D., Boschetti, M., Gusberti, D., Bocchi, S., Acutis, M., 2006. Analysis of rice sample size variability due to development stage, nitrogen fertilization, sowing technique and variety using the visual jackknife. Field Crops Research 97, 135e141. Delisle, G.P., Woodard, P.K., Titus, S.J., Johnson, A.F., 1988. Sample size and variability of fuel weight estimates in natural stands of lodgepole pine. Canadian Journal of Forest Research 18, 649e652. De Oliveira, S.J.R., Stork, L., Lopes, S.J., Lucio, A.D., Feijo´, S., Damo, H.P., 2005. Plot size and experimental unit relationship in exploratory experiments. Scientia Agricola 62, 585e589. Efron, B., 1981. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika 68, 589e599. Efron, B., Tibshirani, R., 1991. Statistical data analysis in the computer age. Science 253, 390e395. Efron, B., Tibshirani, R.J., 1993. An Introduction to Bootstrap. Chapman and Hall/CRC, New York, USA. Frederick, J.R., Camberato, J.J., 1995. Water and nitrogen effects on winter wheat in the Southeastern Coastal Plain: II physiological responses. Agronomy Journal 87, 527e533. Hinkley, D.V., 1983. Jackknife methods. Encyclopedia of Statistical Science 4, 280e287. Jamieson, P.D., Porter, J.R., Goudriaan, J., Ritchie, J.T., van Keulen, H., Stol, W., 1998. A comparison of the models AFRCWHEAT2, CERESWheat, Sirius SUCROS2 and SWHEAT with measurements from wheat grown under drought. Field Crops Research 55, 23e44. Lessman, K.J., Atkins, R.E., 1963. Optimum plot size and relative efficiency of lattice designs for grain sorghum yield test. Crop Science 3, 477e481. Manly, B.F.J., 1997. Randomization, Bootstrap, and Monte Carlo Methods in Biology. Chapman and Hall/CRC, New York, USA. Nath, N., Singh, S.P.N., 1989. Determination of sample size and sample number for armyworm populations studies. Oryza 26, 285e287. Olesen, J.E., Petersen, B.M., Berntsen, J., Hansen, S., Jamieson, P.D., Thomsen, A.G., 2002. Comparison of methods for simulating effects of nitrogen on green area index and dry matter growth in winter wheat. Field Crops Research 74, 131e149. Oloffson, B., 2003. StudySize. CreoStat HB, Frolunda, Sweden.

1800

R. Confalonieri et al. / Environmental Modelling & Software 22 (2007) 1796e1800

Paoletti, C., Donatelli, M., Kay, S., Van den Eede, G., 2003. Simulating kernel lot sampling: the effect of heterogeneity on the detection of GMO contaminations. Seed Science and Technology 31, 629e638. Remund, K., Dixon, D., Wright, D., Holden, L., 2001. Statistical considerations in seed purity testing for transgenic traits. Seed Science Research 11, 101e119. Serrano, L., Filella, I., Pe~ nuelas, J., 2000. Remote sensing of biomass and yield of winter wheat under different nitrogen supplies. Crop Science 40, 723e731. Survey System, 2003. Sample Size Calculator. Creative Research System, Petaluma CA, USA. Thomas, H.L., 1974. Relationship between plot size and plot variance. Agricultural Research Journal of Kerala 12, 178e189. Tirol Padre, A., Ladha, J.K., Punzalan, G.C., Watanabe, I., 1988. A plant sampling procedure for acetylene reduction assay to detect rice varietal differences in ability to stimulate N2 fixation. Soil Biology and Biochemistry 20, 175e183.

Tsegaye, T., Hill, R.L., 1998. Intensive tillage effects on spatial variability of soil physical properties. Soil Science 163, 143e154. Tukey, J.W., 1958. Bias and confidence in not quite large samples. Annals of Mathematical Statistics 29, 614. Windmeijer, P.N., Stomph, T.J., Adam, A., Coppus, R., de Ridder, N., Kandeh, M., Mahamanand, M., van Loon, M., 1998. Transect sampling strategies for semi-detailed characterization of inland valley systems. Netherland Journal of Agricultural Science 46, 15e25. Wolkowski, R.P., Reisdorf, T.A., Bundy, L.G., 1988. Field plot technique comparison for estimating corn grain and dry matter yield. Agronomy Journal 80, 278e280. Yonezawa, K., 1985. A definition of the optimal allocation of effort in conservation of plants genetic resources with application to sample size determination for field collection. Euphytica 34, 345e354. Yuan, L., Yanqun, Z., Haiyan, C., Jianjun, C., Jilong, Y., Zhide, H., 2000. Intraspecific responses in crop growth and yield of 20 wheat cultivars to enhanced ultraviolet-B radiation under field conditions. Field Crops Research 67, 25e33.