648
A resampling variance estimator for the k nearest neighbours technique S. Magnussen, R.E. McRoberts, and E.O. Tomppo
Abstract: Current estimators of variance for the k nearest neighbours (kNN) technique are designed for estimates of population totals. Their efficiency in small-area estimation problems can be poor. In this study, we propose a modified balanced repeated replication estimator of variance (BRR) of a kNN total that performs well in small-area estimation problems and under both simple random and cluster sampling. The BRR estimate of variance is the sum of variances and covariances of unit-level kNN estimates in the area of interest. In Monte Carlo simulations of simple random and cluster sampling from seven artificial populations with real and simulated forest inventory data, the agreement between averages of BRR estimates of variance and Monte Carlo sampling variances was good both for population and for small-area totals. The modified BRR estimator is currently limited to sample sizes no larger than 1984. An accurate approximation to the proposed BRR estimator allows significant savings in computing time. Re´sume´ : Les estimateurs courants de la variance pour la technique des k plus proches voisins (kPPV) ont e´te´ conc¸us pour estimer des totaux de population. Leur efficacite´ peut eˆtre faible dans le cas de proble`me d’estimation sur de petites superficies. Dans cette e´tude, nous proposons un estimateur modifie´ a` re´plication re´pe´te´e et e´quilibre´e (RRE) de la variance du total des kPPV qui est efficace pour des proble`mes d’estimation sur de petites superficies et pour des e´chantillonnages ale´atoires simples et en grappes. L’estimation RRE de la variance est la somme des variances et des covariances des estimations unitaires des kPPV dans la superficie vise´e. Dans des simulations de Monte Carlo d’e´chantillonnages ale´atoires simples et en grappes a` partir de sept populations artificielles comportant des donne´es d’inventaire forestier re´elles et simule´es, les moyennes des estimations de la variance RRE correspondaient bien aux variances d’e´chantillonnage Monte Carlo pour les totaux de population et de petite superficie. La taille maximale de l’e´chantillon de l’estimateur modifie´ RRE est pre´sentement de 1 984. Une approximation juste de l’estimateur RRE propose´ permet de re´duire substantiellement le temps de calcul. [Traduit par la Re´daction]
Introduction Forest inventories are in a good position to take advantage of the nonparametric (distribution-free) k nearest neighbours (kNN) technique for the estimation of quantitative or categorical attribute values for all units in a finite population U (McRoberts et al. 2007; Tomppo et al. 2008). Typically, pdimensional ancillary data (X) correlated with attribute(s) of interest (Y) are known for every unit ui, i = 1, ..., N, in U, while Y values are only known for a subset composed of n units in a probability sample from U, henceforth referred to as reference units. Population units are typically determined by a compromise between the spatial extent of inventory sampling units and the spatial resolution of the ancillary data. A unit-level kNN estimate of Y (~y) is a weighted sum of the Y values associated with the k reference units whose X values are closest to the X values of the unit for which the estimate is made. The distance ranking of reference units is by an adopted distance metric (Finley et al. 2006). Important progress has been achieved in quantifying the uncertainty of a kNN estimate of a population total Ty (Ka-
tila 2006; McRoberts et al. 2007; Magnussen et al. 2009) including a Bayesian approach (Finley et al. 2008). Katila (2006) quantified error in a study with approximately known true values of totals and means. Diagnostic tools help the analyst to assess the expected performance and to identify outliers (McRoberts 2009). Experience confirms that the kNN estimate of Ty (T~ y ) is almost unbiased when N is relatively large and n not too small (Ma¨kela¨ and Pekkarinen 2001; McRoberts et al. 2002; Baffetta et al. 2009). When reference units are obtained from a probability sample, a design-based estimator of the sampling variance is preferable to model-based and model-assisted estimators (Sa¨rndal et al. 1992, p. 238). However, the nonparametric nature of kNN estimators precludes an analytical designbased kNN variance estimator. The requirement of modelbased or model-assisted kNN estimators of uncertainty is that they provide a good approximation to the sampling variance of a kNN estimate in replicated sampling. A nearly unbiased model-assisted empirical kNN difference estimator (DIF) of Ty(U) has been proposed for simple random sampling without replacement (SRSwor) by Baffetta et al. (2009)
Received 25 August 2009. Accepted 13 January 2010. Published on the NRC Research Press Web site at cjfr.nrc.ca on 1 April 2010. S. Magnussen.1 Natural Resources Canada, Canadian Forest Service, 506 West Burnside Road, Victoria, BC V8Z 1M5, Canada. R.E. McRoberts. USDA Forest Service, Northern Research Station, St. Paul, MN 55108, USA. E.O. Tomppo. The Finnish Forest Research Institute, Helsinki, FIN-00170, Finland. 1Corresponding
author (e-mail:
[email protected]).
Can. J. For. Res. 40: 648–658 (2010)
doi:10.1139/X10-020
Published by NRC Research Press
Magnussen et al.
who also suggested a Horwitz–Thomsen type estimator of the sampling variance of Ty (Sa¨rndal et al. 1992, p. 223). DIF also applies to single-stage cluster sampling with equal-sized clusters when the cluster is treated as a single unit and cluster totals are used as data. However, DIF cannot deal with the desired combination of cluster sampling and unit-level kNN estimation (McRoberts et al. 2007; Tomppo et al. 2008). Classical resampling approaches to variance estimators for kNN estimates have generally been disappointing (Shao 1996; Chen and Shao 2001). Spatial subsampling for variance estimation has more promise (Sherman 1996; Lahiri 2003; Nordman and Lahiri 2004). Ekstro¨m and Sjo¨stedt-De Luna (2004) proposed a subsampling variance estimator for sample means that can handle differently distributed and locally dependent observations with smoothly varying expected values. They demonstrated their estimator with a kNN application to the estimation of the total of stem volume in two forest populations of approximately 3000 and 10 300 ha. The estimated variance came reasonably close to the mean squared error derived from a field validation study within the same area. However, the requirement that the estimator is unbiased at the unit level (Ekstro¨m and Sjo¨stedtDe Luna 2004, p. 82, bottom) and that expectations vary slowly across space would seem to preclude it from kNN applications, since it is known that the unit-level kNN estimator is biased (Baffetta et al. 2009). The expected kNN estimate for a unit is a linear combination of all units in the population with a nonzero probability of becoming a k nearest neighbour to the unit in question. It is therefore not unbiased for any k. As well, for typically small forestry plots, the attributes of interest can change rapidly over small spatial distances (Gilbert and Lowell 1997). Finally, the correlation among kNN estimates is foremost governed by the number of shared nearest neighbours. In enviromments with large spatial gradients, this may translate into a correlation that declines and even vanishes with increasing spatial separation. In practical kNN applications, there is often interest in both unit-level kNN estimates and small-area estimates of totals Ty(g) and their uncertainty in subpopulations Ug, g = 1, ..., G of U (U ¼ [G g¼1 Ug ) (Katila 2006). Baffetta et al.’s (2009) variance estimator is sensitive to the number of sample units in the area of interest. There must be at least two sample units before an estimate of variance can be calculated. In cluster sampling, DIF cannot deal with estimation of totals in areas and populations that do not tessellate into an integer number of equal-sized clusters. Model-based approaches to estimation at the subpopulation level can produce inconsistent results when model-dependent relationships vary among subpopulations. Actual sample sizes rarely support modelling at the subpopulation level. Model-assisted estimators encounter similar issues (Opsomer et al. 2007). In this study, we propose a modified version of a balanced repeated replications (BRR) (Kish and Frankel 1970; Shao 1996) estimator of the sampling variance (covariance) of unit-level kNN estimates With variances and covariances estimated at the unit level, the estimator of the variance of any total is simply the sum of the estimated variances and covariances of the units included in the total. Thus, the dis-
649
tribution of sample units across the small area is no longer an issue in terms of variance estimation. Furthermore, the procedures behind BRR allow for the combination of cluster sampling and unit-level kNN estimation of totals and their sampling variance. Our BRR estimator is demonstrated in seven artificial populations with real and simulated data in settings with simple random sampling and single-stage cluster sampling with equal-sized clusters.
Methods Populations and sampling objectives A finite population (U) composed of N equal-sized spatially nonoverlapping compact units (viz. pixels) has q attributes of interest (Y = {Y1, ..., Yq}) associated with each unit. The Y values are unknown. It is desired to estimate the total of Y in U, subsidiary also for any subpopulation (small-area) within U, say U1, ..., Ug. To this end, a without-replacement probability sample (s) of size n is taken from U, and a set of p ancillary variables (X = {X1, ..., Xp}) with known (fixed) values for every population unit is selected on the basis of known or anticipated correlation with Y. Sampled units are collectively referred to as the reference set. Nonsampled units are called target units. It is decided to use the kNN technique to produce the desired estimates. An estimate of the sampling variance of estimated totals is required. Notation To simplify notation, the estimators of variance and covariance are limited to a univariate Y. They apply to a multivariate Y when applications are limited to a trait-by-trait estimation of means, totals, variances, and covariances. This will be the anticipated typical application. Extensions to multivariate applications require no new theory and the necessary modifications are straightforward to implement. The notation ‘‘var’’ stands for the estimator of a variance (mean of squared differences to the mean) and ‘‘cov’’ for the covariance estimator (average crossproduct of differences to the mean of two variables). kNN estimators of totals We adopt a general unit-level kNN estimator of Y for the ith population unit (Haara et al. 1997): X ½1 y~ i ¼ wj yj ; i ¼ 1; :::; N ikj
where wj is a vector of q weights and i ~ kj means that the summation is taken over the k reference units for which xj is a k nearest neighbour to xi with respect to some distance metric in X. For the sake of demonstration, all weights are fixed at k–1, and Euclidean distances between unit-level vectors of X-values are used to identify k nearest neighbours. The kNN estimator of a population total becomes ½2
T~ y ¼
N X
y~ i
i¼1
A unit average is estimated by dividing eq. 2 by N. An estimator of the total in any small area contained in U can be Published by NRC Research Press
650
Can. J. For. Res. Vol. 40, 2010
obtained from eq. 2 with the summation limited to units in the small area. Baffetta et al. (2009) proposed a nearly unbiased empirical difference kNN estimator (DIF) of a total under SRSwor. Their unit-level estimator y~ dif is identical to y~ i for noni sampled units; for a reference unit, their kNN estimator is the mean of its k nearest reference units under exclusion of itself. The DIF estimator of a univariate population total is ½3
dif T~ y ¼
X i2U
1 y~ dif i N n
X
ð~ydif i yi Þ
i2s
where the second sum is a bias adjustment (Sa¨rndal et al. 1992, p. 221). The DIF estimator also applies to small areas in U by limiting the sums to units and reference units within a small area. However, additivity of estimated small-area totals is not assured due to a residual noise problem (Baffetta et al. 2009). The problem is unlikely to be of any serious consequence. Estimators in eqs. 2 and 3 both accommodate cluster sampling with equal-sized clusters when a cluster is treated as a unit and cluster totals are used as X and Y values. In practical applications, it may be preferable to work with unit-level kNN estimation and only account for the cluster design in the estimator of sampling variance. While the general kNN estimator of a total in eq. 2 applies to any design, the DIF estimator in eq. 3 is restricted to SRSwor (Baffetta et al. 2009). Estimators of sampling variance of totals We first introduce our modified version of the BRR estimator of variance (Wolter 1985, chap. 3) and then the model-assisted DIF variance estimator for SRSwor (Sa¨rndal et al. 1992, p. 223). Since both attempt an estimation of the sampling variance, a direct comparison is of interest. Notice that the sampling variance is the sample-to-sample variation in an estimate and it is caused by the random selection of sample units. Spatial correlation of Y values in a population is therefore irrelevant for the estimation of a sampling variance, although they can strongly influence the efficiency of a sampling design (Gregoire 1998). The classic BRR variance estimator of a total is based on the variance of the estimated total derived from L halfsample pseudoreplications of the sample data (Kish and Frankel 1970). The BRR variance estimator is design unbiased when (i) the pseudoreplications form a balanced and orthogonal design, (ii) the sampling variance is inversely proportional to sample size, and (iii) the estimator applied to a halfsample is identical to the estimator applied to the original sample (Wolter 1985, chap. 3). Condition iii is not satisfied by a kNN estimator, since the composition of selected reference units depends on both the sample and the sample size. We tried the BRR estimator without any modification, but our results were disappointing. However, we discovered a significant improvement when we augmented each halfsample with 0.5n imputations derived from the halfsample itself and the X values associated with the complementary halfsample (i.e., the reference units not in a halfsample). With this modification, the composition of selected reference units (but not their values) remains constant across the pseudoreplicates, and it is identical to the composition in the original whole-sample estimator. Thus, the imputations en-
sure that condition iii can be met. The imputation scheme also reflects the fact that X is known for all units in U and thus fixed across all possible samples. Details on how to construct a balanced and orthogonal design of pseudoreplications can be found in, for example, Wolter (1985, p. 320). We chose L = 96 pseudoreplications. The benefits of increasing this number to, say, 124 or even higher were negligible in our examples. Today, the maximum possible size of an orthogonal and balanced design matrix is 992 992 (Beth et al. 1987), which means that the upper limit for L is 992 and n £ 1984. Augmentation of a halfsample to size n was done via 0.5n kNN imputations with k = 2 and otherwise as per eq. 1 with the halfsample now serving as reference units and the complementary halfsample acting as the target units (with known X values). At this point, we do not know whether a different k or a different imputation procedure might work as well, or perhaps even better. However, k = 2 makes the computation of the average (over halfsamples) covariance among imputed values straightforward (see next section). Let yjl0 ; j 2 s; l ¼ 1; :::; L denote the pseudoreplicates of the sample data. From each pseudoreplicate, a kNN estimate y~ il0 is obtained for every unit (i = 1, ..., N) in U as per eq. 1 with yjl0 replacing yj. For nearly unbiased imputations, and given the orthogonal andX balanced design for the pseudoreplications, we have L1 l y~ jl0 ! y~ j as L ! 1, which suggests that we can interpret the variance of y~ jl0 as an approximation to the sampling variance of y~ j . Accordingly, and taking into account the imputation-induced correlation between Y values in a pseudoreplication (yjl0 ; j 2 s), we propose the following modified BRR estimator of sampling variance for a univariate y~ i : ½4
c BRR ð~yi Þ ¼ varð~yil0 Þð1 þ 2 n1 Þð1 n N 1 Þ var
where the factor 1 + 2 n–1 accounts for the abovementioned imputation-induced correlation of Y values in a pseudoreplication and the variance is computed as the ‘‘sample variance’’ over the L pseudoreplications. The factor is derived analytically. Details are given in Appendix A. Our modified BRR estimator of covariance between two unit-level kNN estimates is derived by noting that their correlation is proportional to the proportion (1/k, 2/k, ..., k/k) of shared reference units and the weights (wj) in the kNN estimator (cf. eq. 1). With our (univariate) weights fixed at k–1, the estimator of the sampling covariance between two distinct units in a population becomes qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c BRR ð~yi Þvar c BRR ð~yj Þ ½5 cd ov BRR ð~yi ; y~ j Þ ¼ Cij k1 var where Cij is the number of shared reference units between y~ i and y~ j . A differential weighting scheme requires replacing Cij k–1 with a sum of crossproducts of weights associated with the shared reference units. After combining unit-level estimators in eqs. 4 and 5, the modified BRR estimator of sampling variance of a kNN total a univariate Y becomes X c BRR ½T~ y ¼ cd ov BRR ð~yi ; y~ j Þ ½6 var i; j2U
Published by NRC Research Press
Magnussen et al.
651
The BRR estimator for an average is obtained by dividing the variance in eq. 6 by N2. For large populations, we recommend an accurate time-saving computational shortcut (see Appendix A for details). The estimator in eq. 6 works for any small area in U irrespective of the distribution of sample units across small areas. The estimator also applies without modification to cluster sampling if clusters are treated as the unit of observation and cluster totals of X and Y are used as data. The modified BRR estimator of variance also adapts to cluster sampling with unit-level kNN estimation. Details are given in Appendix A. The DIF variance estimator for a kNN DIF estimate of a univariate total under SRS is (Baffetta et al. 2009; eq. 6) NU n dif ~ c BRR ðT y ðUÞÞ ¼ varðej Þj2s ½7 var 1 n NU where ej ¼ y~ j yj . The DIF variance estimator extends naturally to small-area estimation problems and cluster sampling with equal-sized clusters but not to cluster sampling with unit-level kNN estimation. To produce a reliable estimate of variance for a small area, the sample size in the area of interest should not be too small. Assessment of estimators The performance of the proposed modified BRR estimators of a kNN population total is assessed in a Monte Carlo (MC) simulation of random sampling from seven populations (see next section) with a sample size of n. Sampling was repeated 10 000 times, and for each replicate, we obtained estimates of totals (eqs. 2 and 3) and their estimated sampling variances (eqs. 6 and 7). The MC average of an estimate is denoted by an ‘‘MC’’ subscript whereas the average of estimated values is recognized by their abbreviations BRR and DIF, respectively. Calculated 95% confidence intervals (CI95) for a total are assessed by the proportion (P) of intervals that includes the true total. We use the 2.5 and 97.5 percentiles of Student’s t distribution with n – 1 degrees of freedom to compute a CI95. For strictly positive Y values, the CI95s were first estimated on a logarithmic scale and then back-transformed to the original scale to avoid negative lower confidence limits (Lloyd 1999, p. 101). To facilitate a comparison of results, we report on relative bias (MC average of the estimated totals minus the true value divided by the true value) and relative standard errors (the square root of an estimated variance of a population quantity divided by its true value). Applications The performance of the proposed modified BRR variance estimator for a kNN estimate of a population (small area) total (cf. eq. 6) is demonstrated with MC simulation of random sampling from seven artificial populations: four (FIA1, FIA2, MIN, and PEAT) with real data from national forest inventories and three (GAU576_0.2, GAU576_0.4, and GAU576_0.6) with simulated data. Results are based on 6000 MC replications and k is fixed at the value that produce the lowest relative root mean squared error with the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DIF estimator (RRMSE ¼
variance þ bias2 =Ty ). However,
in our test settings, the k value yielding a minimum RRMSE was the same for both DIF and BRR The sampling design for the MC simulations for FIA1, FIA2, MIN, and PEAT is SRSwor with sample size n = 24. Trials with a larger (32) and a smaller (16) sample size did not reveal any new trends not covered by n = 24. For the three GAU576 populations, we ran MC simulations with both SRSwor (n = 48) and clustering without replacement (CLUwor) (n = 12). Again, no new trends emerged if sample sizes were increased by 12 or lowered by 12. FIA1 and FIA2 Data for these populations came from the Forest Inventory and Analysis (FIA) program of the US Forest Service. FIA1 and FIA2 are located in forested areas in Minnesota. A population unit is a Landsat 7 ETM+ pixel approximately 25 m 25 m in size with a GPS colocated FIA plot providing trivariate vectors of yi, i = 1, ..., N (Bechtold and Patterson 2005). We assume that the data observed on the plots adequately characterize the population units containing the plot centers whose coordinates were determined from GPS. Variables in y are Y1 = number of trees per hectare (TPH), Y2 = basal area (square metres per hectare) (BA), and Y3 = merchantable volume (cubic metres per hectare) (VOL). Our xi, i = 1, ..., N, was available in the form of a trivariate single-index model transform (Ha¨rdle et al. 1993) of a 12dimensional vector (xi) of Landsat 7ETM+ derived pixel data (McRoberts 2006; Magnussen et al. 2009). We call these transformations X1, X2, and X3 whereby Xq is the best linear predictor of Yq (q = 1, 2, 3). The choice of X was guided by the convenience of a low-dimensional X. It also facilitates a comparison with results of an earlier study (Magnussen et al. 2009). The relationship between {Y1,Y2,Y3} and the three transforms were not any stronger than the relationships between {Y1,Y2,Y3} and the 12 Landsat-derived variables. We created FIA1 and FIA2 populations of size of N = 600 and N = 900 by random draws from two larger pools of FIA data from Minnesota (McRoberts 2006). A data summary for N = 900 is given in Table 1. The two population sizes will reveal whether the ratio of population size to sample size plays an important role for the performance of the BRR variance estimator. MIN and PEAT Data for these two populations came from forest inventory plots in the 9th Finnish National Forest Inventory located in the eastern part of central Finland (North Karelia and South Savo). A population unit is a quarter of a Landsat 7 ETM+ image pixel, approximately 12.5 m 12.5 m in size. Each unit has a colocated forest inventory plot providing yi, i = 1, ..., N. Here, the Y variables are Y1 = basal-areaweighted diameter of tree stems 1.3 m above ground (DBH), Y2 = basal area (square metres per hectare) of tree stems 1.3 m above ground (BA), and Y3 = total stem volume (cubic metres per hectare) (VOL). Our xi variables were as above (and for the same reasons), trivariate single-index model transform data of nine Landsat 7 ETM+ bands (scene path 186, rows 16 and 17, acquisition date 10 June 2000). Details are given in Tomppo and Halme (2004). MIN and PEAT populations of sizes N = 600 and N = 900 Published by NRC Research Press
652
Can. J. For. Res. Vol. 40, 2010 Table 1. Data summary for FIA1 and FIA2 (N = 900). FIA1 Y = TPH (stemsha–1) Y = BA (m2ha–1) Y = VOL (m3ha–1) X1 X2 X3 Correlations b r ðX; YÞ
FIA2
Mean 1895 18.7 81 0.8 0.7 0.8 TPH
Minimum Maximum 59 15 918 0.5 68.9 0 436 0.4 0.9 0.4 1.0 0.5 1.0 BA VOL
SD 2119 13.1 76 0.1 0.1 0.1
Mean Minimum Maximum 411 60 1666 14.9 0.8 60.2 83 2 574 1.1 0.6 1.1 0.8 0.3 1.3 1.1 0.6 1.1 TPH BA VOL
X1
0:06
0:31
0:27
X1
0:20
0:21
0:20
X2
0:18
0:15
0:08
X2
0:27
0:20
0:17
X3
0:02
0:29
0:29
X3
0:18
0:21
0:21
SD 291 11.2 74 0.1 0.1 0.1
Table 2. Data summary for MIN and PEAT (N = 900). MIN Y = DBH (cm) Y = BA (m2ha–1) Y = VOL (m3ha–1) X1 X2 X3 Correlations b r ðX; YÞ
PEAT
Mean Minimum Maximum 19.8 8.0 43.0 21.3 1.0 43.0 154 0 485 0.9 0.7 1.2 1.0 0.8 1.3 1.0 0.8 1.3 DBH BA VOL
SD 7.0 8.1 93 0.1 0.1 0.1
Mean Minimum Maximum 16.4 7.0 40.0 17.2 1.0 45.0 105 0 445 0.9 0.7 1.3 0.9 0.7 1.4 0.9 0.7 1.4 DBH BA VOL
X1
0:29
0:48
0:48
X1
0:29
0:56
0:49
X2
0:15
0:55
0:48
X2
0:26
0:59
0:50
X3
0:20
0:54
0:49
X3
0:27
0:59
0:51
were generated by random sampling from a larger pool of available units with known X and Y values and located on, respectively, mineral and peat soils. A data summary for N = 900 is given in Table 2. GAU576 Three populations with N = 576 units were generated with univariate X and Y values generated from a bivariate Gaussian distribution with equal means (30) and variances (40) of X and Y. The three populations are distinguished by the correlation coefficient rXY between X and Y. Values of rXY were 0.2, 0.4, and 0.6. They are believed to cover the range of correlations in forestry applications of kNN. Accordingly, the populations are called GAU576_0.2, GAU_0.4, and GAU_0.6. Population units were then grouped into 144 square clusters of four units according to their X and Y values. Clusters were formed by the k-means clustering algorithm (Everitt et al. 2001, chap. 4.2) applied to the list of simulated (X,Y) values. The clusters were then transferred to a regular 12 12 array in a row-wise fashion starting at the upper left corner of the array and finishing at the lower right corner of the array. This procedure generated populations with spatially correlated X and Y values. The intracluster correlation coefficients (Cochran 1977, p. 209) were typically 0.7 for both X and Y. To demonstrate small-area applications, each GAU676 population was split into three equal-sized spatially compact areas (A1, A2, and A3) composed of 192 units in 48 clus-
SD 5.3 7.0 71 0.1 0.1 0.1
ters. Differences in totals among A1, A2, and A3 are purely of a random nature. Estimates of totals Ty (eqs. 2 and 3) and sampling variances (eqs. 6 and 7) were derived for each of these three small areas. Because the DIF variance estimator in eq. 7 requires a minimum sample size of 2, we only retained MC samples with at least two observations in A1, A2, and A3. According to the multinomial distribution, there is a chance of approximately 0.16 that the sample size in at least one of the small areas will drop below 2. Note that all small-area estimations are done for the k value that minimized the RRMSE for the population total. It is generally not possible to optimize separately the choice of k in smallarea estimation problems due to low sample sizes in these areas.
Results FIA1 and FIA2 Results are limited to N = 900, BA, and VOL. Results from N = 600 only confirmed the relative performance of DIF and BRR estimators. We dropped TPH for reasons of parsimony. Results for TPH were, as expected, the least accurate. A k of around 12 achieved the lowest RRMSE on BA and VOL but results for 11, 12, and 13 were nearly indistindif guishable. T~ y is a nearly unbiased estimator of the population totals (Table 3). Estimates of (absolute) bias were less Published by NRC Research Press
Magnussen et al.
653
Table 3. FIA1 and FIA2 summaries of BRR and DIF estimates. FIA1 VOL (m3ha–1) 89 0.7
BA (m2ha–1) 14.9 0.1
VOL (m3ha–1) 83.8 0.1
ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102 dif var MC ðT~ y Þ0:5 Ty1 102
–0.3
–0.5
–0.1
–0.2
14.7 14.1
18.8 18.0
16.9 15.0
19.9 17.8
c BRR ðT~ y Þ0:5 Ty1 102 var dif c DIF ðT~ y Þ0:5 Ty1 102 var
14.0
18.7
17.4
20.5
14.3
18.2
15.6
18.4
Estimate Ty ðT~ y Ty Þ Ty1 102
3 4 5 6 7 8 9
FIA2
BA (m2ha–1) 17.0 0.2
Row 1 2
dif
d BRR b y 2 CI95 P½T d DIF b P½Ty 2 CI95
0.96
0.94
0.98
0.97
0.98
0.98
0.98
0.98
Note: Relative bias of estimated totals is in rows 2 and 3, relative standard errors are in rows 4–7, and coverage of CI95s is in rows 8 and 9. Population size N = 900.
Table 4. MIN and PEAT summaries of BRR and DIF estimates. MIN Row 1 2
Estimate Ty ðT~ y Ty Þ Ty1 102
3 4 5 6 7 8 9
(m2ha–1)
PEAT (m3ha–1)
BA (m2ha–1) 17.5 0.6
VOL (m3ha–1) 108.0 0.3
BA 21.8 0.6
VOL 158 0.2
ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102 dif var MC ðT~ y Þ0:5 Ty1 102
–0.2
–0.3
–0.4
–0.2
6.4 6.3
11.2 10.7
7.3 6.9
12.0 11.8
c BRR ðT~ y Þ0:5 Ty1 102 var dif c DIF ðT~ y Þ0:5 Ty1 102 var
6.5
11.4
7.4
11.9
6.6
11.1
6.9
12.2
dif
d BRR b y 2 CI95 P½T d DIF b y 2 CI95 P½T
0.97
0.96
0.96
0.95
0.97
0.96
0.98
0.96
Note: Relative bias of the estimated totals is in rows 2 and 3, relative standard errors are in rows 4–7, and coverage of CI95s is in rows 8 and 9). Population size N = 900.
than 0.5%. T~ y performed in FIA1 nearly as well, but in FIA2, its bias approached 1%. Relative standard errors of dif T~ y varied between 14% and 18% and those of T~ y between 15% and 20%. BRR estimates of relative standard error (row 6) matched, on average, their MC estimates (row 4). DIF estimates of error (row 7) were approximately 2.5% above their MC estimates (row 5). Results based on the shortcut approximation (Appendix A, eq. A3) were almost identical (no difference larger than 1.3%). Coverage of calculated CI95s based on the BRR estimates of variance varied from 0.94 (VOL, FIA1) to 0.98 (BA, FIA2), while the DIF-based coverage was 0.98. MIN and PEAT Results are again limited to N = 900, BA, and VOL for reasons stated above. A k of around 8 in MIN and of around 10 in PEAT achieved the lowest RRMSE. A k value within ±1 of these values produced very similar results. dif T~ y is again a nearly unbiased estimator of the population total (Table 4) with estimated bias less than 0.2%. Bias of T~ y was within ±1%. Relative standard errors of T~ y (row 4)
dif were around 7% for BA and 12% for VOL. For T~ y , they were slightly lower (row 5). BRR estimates of relative standard error (row 6) were approximately within 2% of their MC estimate. Using the shortcut variance approximation (Appendix A, eq. A3) barely changed the results. DIF estimates of standard error (row 7) were about 3% above their MC estimates (row 5). BRR and DIF CI95 coverages of calculated CI95s were similar and slightly above their nominal (row 9).
GAU576 Results for SRSwor are summarized in Table 5. All averages of kNN estimates of a population total were within 0.12% of the actual total. As expected, the DIF estimator is nearly unbiased (absolute bias £ 0.04%). The bias of T~ y and dif T~ y showed no clear trend with rXY. The k value producing the lowest RRMSE of a total was 10 for rXY = 0.2, 7 for rXY = 0.4, and 5 for rXY = 0.6. Again, k values 1 larger or 1 less would only introduce a very minor change in the RRMSE. MC relative standard errors of T~ y dropped from 3.1% in the population with rXY = 0.2 to 2.6% in the popuPublished by NRC Research Press
654
Can. J. For. Res. Vol. 40, 2010 Table 5. GAU576 summaries of BRR and DIF estimates under SRSwor. rXY ¼ 0:2 b k opt ¼ 10
rXY ¼ 0:4 b k opt ¼ 7
rXY ¼ 0:6 b k opt ¼ 5
17 070 0.12
17 070 0.05
17 070 0.07
Row 1 2
Estimate Ty ðT~ y Ty Þ Ty1 102
3
ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102 dif var MC ðT~ y Þ0:5 Ty1 102
–0.03
–0.04
–0.01
3.11 3.03
2.92 2.98
2.56 2.59
c BRR ðT~ y Þ0:5 Ty1 102 var dif c DIF ðT~ y Þ0:5 Ty1 102 var
3.49
3.38
3.07
3.17
3.04
2.74
4 5 6 7 8 9
dif
d BRR b y 2 CI95 P½T d DIF b P½Ty 2 CI95
0.97
0.97
0.98
0.95
0.96
0.95
Note: Relative bias of the estimated totals is in rows 2 and 3, relative standard errors are in rows 4– 7, and coverage of CI95s is in rows 8 and 9.
Table 6. GAU576 summaries of BRR and DIF estimates for small area A1 under SRSwor. Row 1 2
Estimate Ty (A1) ðT~ y Ty Þ Ty1 102
3 4 5 6 7 8 9
rXY = 0.2 5706 –0.2
rXY = 0.4 5706 –0.3
rXY = 0.6 5706 –0.2
ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102 dif var MC ðT~ y Þ0:5 Ty1 102
0.1
0.1
0.1
3.1 3.5
2.9 3.3
2.6 3.0
c BRR ðT~ y Þ0:5 Ty1 102 var dif c DIF ðT~ y Þ0:5 Ty1 102 var
5.7
5.4
4.9
5.7
5.5
4.9
d BRR b y 2 CI95 P½T d DIF b P½Ty 2 CI95
0.98
0.98
0.98
0.95
0.95
0.96
dif
Note: Relative bias of estimated totals is in rows 2 and 3, relative standard errors are in rows 4–7, and coverage of CI95s is in rows 8 and 9. The k-value is four.
lation with rXY = 0.6. Corresponding BRR estimates dropped from 3.5% to 3.1%. Results based on the BRR variance approximation (Appendix A, eq. A3) were very similar (within dif 1.8%). MC relative standard errors of T~ y were on a par with those for T~ y and were matched by the DIF-based variance estimator (rows 5 and 7). Coverage of BRR-based CI95s for Ty varied from 0.97 to 0.98. DIF-based coverages matched their nominal value to within 0.01. Small-area estimates for A1, A2, and A3 confirmed the bias reduction potential of the DIF estimator (maximum estimate of bias = 0.3%). Results for A1 are given in Table 6 (results for A2 and A3 are identical to within MC errors). With the general kNN estimator in eq. 2, the maximum estimate of bias of a small-area estimate was 0.8%. All estimates of relative standard error matched (on average) their MC counterparts (within 0.5%) but DIF estimates were almost twice as large, a factor attributed to the smaller sample sizes and to a multinomial variation in small-area sample sizes. Coverage of calculated confidence intervals was conservative for BRR (0.98) and good for DIF (0.96). A summary of the results for the CLUwor design with n = 12 (clusters) is given in Table 7. Clusters are regarded as a single unit with cluster totals of X and Y as data. The effi-
ciency of this design, relative to a SRSwor design with the same total number of sample units (48), is, due to a strong intracluster correlation of Y values, only about 0.3 (Cochran 1977, p. 21). On a relative scale, however, the performance differences between BRR and DIF estimators remained as under SRSwor. Small-area results indicated a serious bias for totals estimated via eq. 2 (maximum bias = 3% with rXY = 0.2, 5% with rXY = 0.4, and 6% with rXY = 0.6). In contrast, DIF-based totals had much smaller bias (maximum 1%) for DIF. DIF and BRR estimates of relative errors matched their MC values (to within 10%) but DIF-based estimates were, as for SRSwor and for the same reasons, again almost twice as large as the error of a population total. Given a strong correlation among small-area kNN totals (>0.9), the BRR relative error of a small-area total was just slightly larger than the error of the population total. Coverage of DIFbased intervals was good (0.94–0.96) whereas BRR-based intervals, due to the bias problem, rarely achieved the nominal level (range = 0.82–0.96, mean = 0.90). Results for the combination of CLUwor with unit-level kNN estimation are given in Table 8. The MC-based estimates of errors of a population total (row 4) were, apart from MC errors (Koehler et al. 2009), identical to those for CLUwor with cluster-level kNN estimation. BRR estimates Published by NRC Research Press
Magnussen et al.
655 Table 7. GAU576 summaries of BRR and DIF estimates under CLUwor. Row 1 2
Estimate Ty ðT~ y Ty Þ Ty1 102
3
ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102 dif var MC ðT~ y Þ0:5 Ty1 102
–0.0
–0.0
–0.0
6.0 6.0
5.6 5.5
4.7 4.9
c BRR ðT~ y Þ0:5 Ty1 102 var dif c DIF ðT~ y Þ0:5 Ty1 102 var
6.2
5.3
5.0
6.4
6.1
5.2
d BRR b y 2 CI95 P½T d DIF b y 2 CI95 P½T
0.96
0.98
0.98
0.96
0.95
0.96
4 5 6 7 8 9
rXY = 0.2 17 070 0.1
dif
rXY = 0.4 17 070 0.0
rXY = 0.6 17 070 0.1
Note: Relative bias of estimated totals is in rows 2 and 3, relative standard errors are in rows 4–7, and coverage of CI95s is in rows 8 and 9. The k value is 4. Clusters are treated as a single unit and cluster totals of X and Y are used as data.
Table 8. GAU576 summaries of BRR estimates under CLUwor and unit-level kNN estimation.
Row 1 2 4 6 8
Estimate Ty ðT~ y Ty Þ Ty1 102 var MC ðT~ y Þ0:5 Ty1 102
rXY ¼ 0:2 b k opt ¼ 10
rXY ¼ 0:4 b k opt ¼ 7
rXY ¼ 0:6 b k opt ¼ 5
17 070 0.4
17 070 0.3
17 070 0.3
6.0 6.4
5.4 5.6
4.8 4.8
0.94
0.95
0.94
c BRR ðT~ y Þ0:5 Ty1 102 var d BRR b y 2 CI95 P½T
of relative errors matched their MC counterparts within 7% and coverage of calculated confidence interval was good (0.94–0.95). Small-area estimation results were not materially different from those with cluster-level kNN estimation, but we attribute this to the fact that each small area is composed of an equal number of equal-sized clusters. Rotation of units within a cluster in the BRR imputation procedure improved the coverage rates by about 0.05 but produced otherwise no net benefit.
Discussion The kNN technique of selecting k reference units based on sample-dependent distance ranks in the space of one or more ancillary variables makes it resistant to conventional resampling approaches for estimation of sampling variance (Shao 1996; Chen and Shao 2001). Even spatial subsampling (Ekstro¨m and Sjo¨stedt-De Luna 2004) does not seem to apply in a straightforward manner, as the estimated variance is that of the kNN averages in the blocks of subsampling, not necessarily the sampling variance. The unique feature of orthogonal and balanced halfsample pseudoreplications paired with the flexibility to mirror the kNN estimator by adding imputations to each halfsample is what makes BRR suitable for the kNN variance estimation problem. The performance of the BRR estimator hinges on the assumption that the unit-level average of the kNN estimates across the halfsample pseudoreplications is equal to the original kNN estimate. In our examples, the difference was always less
than 1%, which we take as empirical support for the assumption. Thus, whenever it is reasonable to assume that a kNN estimate of a total is nearly unbiased (Ma¨kela¨ and Pekkarinen 2001; McRoberts et al. 2002; Tomppo and Halme 2004; McRoberts et al. 2007; Meng et al. 2007; LeMay et al. 2008; Baffetta et al. 2009), we expect our BRR estimator of variance to perform well. In SRSwor applications of the kNN technique, the good performance and easy-to-use DIF estimator by Baffetta et al. (2009) makes it an obvious choice for the estimation of totals and their sampling variance. The residual noise probdif lem (T~ y 6¼ T~ y ) is likely to be trivial for reasonably large populations. The effective bias reduction in DIF also makes it attractive in small-area estimation problems. However, unreliable estimates of variance may result if sample sizes are small. In small-area problems, the advantage of BRR is a more robust estimator of variance, but the potential of a serious bias in an estimated total is a clear detractor. In forestry applications of the kNN technique, the reference data are typically collected under a probability design with equal-sized clusters (McRoberts et al. 2007; LeMay et al. 2008; Tomppo et al. 2008) but interest centers on unitlevel kNN estimation. In this setting, only model-based estimators of mean squared error of totals have been forthcoming (Magnussen et al. 2009). The ability of the ‘‘rotated’’ variant of the BRR estimator, in this setting, to produce very reasonable estimates of variance and confidence intervals with good coverage was promising although limited to Published by NRC Research Press
656
a case with univariate Gaussian values for X and Y. Further testing and application to real data is needed to complete a more thorough assessment. For practical reasons (computation time), our assessment of the BRR variance estimator has been limited to fairly small populations and relatively small sample sizes. In preliminary studies, we did run MC simulations with slightly larger populations and sample sizes without seeing any sign that the performance of the BRR estimator in some way is sensitive to sample size. Nor does the number k of reference units used in a kNN estimator seem to impact the performance; even when k exceeds 0.5n, we saw no decline. Despite apparent complexity, the BRR estimator is straightforward to implement. Design matrices for balanced and orthogonal pseudoreplications are readily available on the internet or in books (e.g., Beth et al. 1987, p. 50). Today, the maximum dimensions of the BRR design matrix are 992 992, which means that the sample size must be less than 2 992 = 1984. This may limit application to a regional scale (Katila 2006; Maselli and Chiesi 2006; LeMay et al. 2008). The time to compute a BRR covariance matrix for a large number of units (>2000) can be taxing even in an environment with high-performance desktop computers. The proposed accurate shortcut, which uses an approximation to the average correlation and to the average crossproduct of standard errors, effectively eliminates the issue of computing time. Conclusions In MC simulation of two common sampling designs and seven populations, the proposed BRR resampling-based estimator of sampling variance of a kNN total performed well. Performance was also good when reference units were obtained under cluster sampling and kNN estimation occurred at the unit level. Our BRR variance of a total is the sum of variances and covariances of the units entering the total. This feature makes the BRR estimator suitable for smallarea estimation problems.
Acknowledgements We are indebted to Dr. Lorenzo Fattorini at the Univerisita´ degli Studi di Siena, Italy, for his helpful comments, suggestions, and challenges to earlier versions of this manuscript and his kind permission to use the data detailed in Baffetta et al. (2009) for a separate testing of our proposed variance estimator. Our thanks also extend to three anonymous journal referees and the journal editors who provided numerous comments and suggestions of improvement to an earlier version of our manuscript.
References Baffetta, F., Fattorini, L., Franceschi, S., and Corona, P. 2009. Design-based approach to k-nearest neighbours technique for coupling field and remotely sensed data in forest surveys. Remote Sens. Environ. 113(3): 463–475. doi:10.1016/j.rse.2008.06.014. Bechtold, W.A., and Patterson, P.L. 2005. The enhanced forest inventory and analysis program – national sampling design and estimation procedures. U.S. For. Serv. Gen. Tech. Rep. SRS-80. Beth, T., Jungnickel, D., and Lenz, H. 1987. Design theory. 2 ed. Cambridge University Press, New York.
Can. J. For. Res. Vol. 40, 2010 Chen, J.H., and Shao, J. 2001. Jackknife variance estimation for nearest-neighbor imputation. J. Am. Stat. Assoc. 96(453): 260– 269. doi:10.1198/016214501750332839. Cochran, W.G. 1977. Sampling techniques. Wiley, New York. Ekstro¨m, M., and Sjo¨stedt-De Luna, S. 2004. Subsampling methods to estimate the variance of sample means based on non-stationary spatial data with varying expected values. J. Am. Stat. Assoc. 99(465): 82–95. doi:10.1198/016214504000000106. Everitt, B.S., Landau, S., and Leese, M. 2001. Cluster analysis. 4th ed. Arnold, London, U.K. Finley, A.O., McRoberts, R.E., and Ek, A.R. 2006. Applying an efficient k-nearest neighbor search to forest attribute imputation. For. Sci. 52(2): 130–135. Finley, A.O., Banerjee, S., and McRoberts, R.E. 2008. A Bayesian approach to multi-source forest area estimation. Environ. Ecol. Stat. 15(2): 241–258. doi:10.1007/s10651-007-0049-5. Gilbert, B., and Lowell, K. 1997. Forest attributes and spatial autocorrelation and interpolation: effects of alternative sampling schemata in the boreal forest. Landsc. Urban Plan. 37(3–4): 235–244. doi:10.1016/S0169-2046(97)80007-2. Gregoire, T.G. 1998. Design-based and model-based inference in survey sampling: appreciating the difference. Can. J. For. Res. 28(10): 1429–1447. doi:10.1139/cjfr-28-10-1429. Haara, A., Maltamo, M., and Tokola, T. 1997. The k-nearest-neighbour method for estimating basal area diameter distribution. Scand. J. For. Res. 12(2): 200–208. doi:10.1080/ 02827589709355401. Ha¨rdle, W., Hall, P., and Ichimura, H. 1993. Optimal smoothing in single-index models. Ann. Stat. 21(1): 157–178. doi:10.1214/ aos/1176349020. Katila, M. 2006. Empirical errors of small area estimates from the multisource national forest inventory in eastern Finland. Silva Fenn. 40(4): 729–742. Kish, L., and Frankel, M.R. 1970. Balanced repeated replications for standard errors. J. Am. Stat. Assoc. 65(331): 1071–1094. doi:10.2307/2284276. Koehler, E., Brown, E., and Haneuse, J.-P.A. 2009. On the assessment of Monte Carlo error in simulation-based statistical analyses. Am. Stat. 63(2): 155–162. doi:10.1198/tast.2009.0030. Lahiri, S.N. 2003. Resampling methods for dependent data. Springer, New York. LeMay, V., Maedel, J., and Coops, N.C. 2008. Estimating stand structural details using nearest neighbor analyses to link ground data, forest cover maps, and Landsat imagery. Remote Sens. Environ. 112(5): 2578–2591. doi:10.1016/j.rse.2007.12.007. Lloyd, C.J. 1999. Analysis of categorical variables. John Wiley, New York. Magnussen, S., McRoberts, R.E., and Tomppo, E. 2009. Modelbased mean square error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories. Remote Sens. Environ. 113(3): 476–488. doi:10. 1016/j.rse.2008.04.018. Ma¨kela¨, H., and Pekkarinen, A. 2001. Estimation of timber volume at the sample plot level by means of image segmentation and Landsat TM imagery. Remote Sens. Environ. 77(1): 66–75. doi:10.1016/S0034-4257(01)00194-8. Maselli, F., and Chiesi, M. 2006. Evaluation of statistical methods to estimate forest volume in a Mediterranean region. IEEE Trans. Geosci. Remote Sens. 44(8): 2239–2250. doi:10.1109/ TGRS.2006.872074. McRoberts, R.E. 2006. A model-based approach to estimating forest area. Remote Sens. Environ. 103(1): 56–66. doi:10.1016/j. rse.2006.03.005. McRoberts, R.E. 2009. Diagnostic tools for nearest neighbors techPublished by NRC Research Press
Magnussen et al. niques when used with satellite imagery. Remote Sens. Environ. 113(3): 489–499. doi:10.1016/j.rse.2008.06.015. McRoberts, R.E., Nelson, M.D., and Wendt, D.G. 2002. Stratified estimation of forest area using satellite imagery, inventory data, and the k-Nearest Neighbors technique. Remote Sens. Environ. 82(2–3): 457–468. doi:10.1016/S0034-4257(02)00064-0. McRoberts, R.E., Tomppo, E.O., Finley, A.O., and Heikkinen, J. 2007. Estimating areal means and variances of forest attributes using the k-Nearest Neighbors technique and satellite imagery. Remote Sens. Environ. 111(4): 466–480. doi:10.1016/j.rse.2007. 04.002. Meng, Q.M., Cieszewski, C.J., Madden, M., and Borders, B.E. 2007. K nearest neighbor method for forest inventory using remote sensing data. GISci. Remote Sens. 44(2): 149–165. doi:10. 2747/1548-1603.44.2.149. Nordman, D.J., and Lahiri, S.N. 2004. On optimal spatial subsample size for variance estimation. Ann. Stat. 32(5): 1981–2027. doi:10.1214/009053604000000779. Opsomer, J.D., Breidt, F.J., Moisen, G.G., and Kauermann, G. 2007. Model-assisted estimation of forest resources with generalized additive models. J. Am. Stat. Assoc. 102(478): 400–409. doi:10.1198/016214506000001491. Sa¨rndal, C.E., Swensson, B., and Wretman, J. 1992. Model assisted survey sampling. Springer, New York. Shao, J. 1996. Resampling methods in sample surveys. Statistics, 27(3): 203–237. doi:10.1080/02331889708802523. Sherman, M. 1996. Variance estimation for statistics computed from spatial lattice data. J. R. Stat. Soc. B, 58(3): 509–523. Tomppo, E., and Halme, M. 2004. Using coarse scale forest variables as ancillary information and weighting of variables in kNN estimation: a genetic algorithm approach. Remote Sens. Environ. 92(1): 1–20. doi:10.1016/j.rse.2004.04.003. Tomppo, E., Olsson, H., Stahl, G., Nilsson, M., Hagner, O., and Katila, M. 2008. Combining national forest inventory field plots and remote sensing data for forest databases. Remote Sens. Environ. 112(5): 1982–1999. doi:10.1016/j.rse.2007.03.032. Wolter, K.M. 1985. Introduction to variance estimation. Springer, New York.
Appendix A Correlation of Y values in a pseduoreplication of the sample In each of the L pseudoreplications of the sample half of the n Y values are kNN imputations with k = 2 obtained from the halfsample selected according to the balanced orthogonal design. The imputed values generate a unit-level correlation between Y values in a pseudoreplication. In the original sample, the units are independent by virtue of their random selection from the population. The among-unit within-pseudoreplication correlation lowers the variance among pseudoreplications of ~yjl0 (Cochran 1977, p. 240) relative to what it would have been with no imputations. We can derive the expected average (across pairs of units and pseudoreplications) unit-level correlation under the assumption that each sample unit appears an equal number of times in the balance of 0.5n L imputations. The initial stratification of the sample on X and the balanced and orthogonal BRR design make the assumption tenable for all but the sample units with extreme X values. With our choice of k = 2 for the imputations any two of the n units in a pseudoreplication can have zero, one, or two units in common. There
657
is a 1 in 4 chance that neither of two randomly selected units is an imputation. In this event, the two drawn units have no unit in common and their correlation is zero according to the randomization of the sample selection. There is a 1 in 2 chance that one of two drawn units is an imputation and the other is not, in which case they can have zero or one unit in common. The expected number of common units is 4 n–1 as per a hypergeometric probability distribution with two draws, one success (the nonimputed unit), and 0.5n halfsample units to choose from. The average correlation according to this scenario is 4 n–1 k–1. Finally, there is a 1 in 4 chance that two drawn units are both kNN imputations. In this case, they can have zero, one, or two units in common. The average expected number of units in common is 8 n–1 as per a hypergeometric distribution with two draws, two successes, and 0.5n halfsample units to choose from. For this scenario the average correlation is 8 n–1 k–2. Combining the three event scenarios, the expected correlation becomes 0.25 0 + 0.5 4 n–1k–1 + 0.25 8 n–1k–2. Inserting k = 2, we get 2 n–1. The amongpseudoreplication variance due to this correlation is consequently reduced by a factor (1 + 2 n–1). To recover the anticipated sampling variance, we must multiply the amongpseudoreplication variance by this factor. A computational shortcut for the BRR variance estimator Computation of the BRR covariances in eq. 5 becomes time consuming for large populations. A fast shortcut is possible if one replaces the 0.5N (N – 1) distinct covariances with an approximation to their mean value. To keep notation simple, we demonstrate this shortcut for a univariate y but the shortcut extends with only obvious modifications to the multivariate case. The shortcut begins with an approximation to the average correlation between y~ j0 and y~ k0 ; j 6¼ k; j; k ¼ 1; :::; N, caused by sharing common reference units (note: the above among-unit within-pseudoreplication correlation due to imputations is accounted for in eq. 4). Then, we obtain an approximation to the average over all possible pairs of units in U of the term q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c BRR ð~yj Þ in eq. 5. We then multiply these c BRR ð~yi Þvar var
two approximations to get the approximation to the average covariance. An approximation to the average correlation (ry~ ði; jÞ) is calculated by tracking the number of times each reference unit appears as one of the k nearest neighbours in y~ j ; j ¼ 1; :::; N. Since X is fixed, these numbers would be the same for y~ j0 ; j ¼ 1; :::; N. Let fj be this number for reference unit j = 1, ..., n. After some simple algebra, we get Xn
½A1
f2 N Xn j¼1 i b with ry~ ði; jÞ Xn f ¼kN j¼1 i f ðN 1Þ j¼1 i
A differential weighting of reference units (cf. eq. 1) will, everything else being equal, lower the average correlation between two unit-level kNN estimates. Conversely, it increases their sampling variance. Assuming a uniform distribution of weights on the open interval [0,1], a second-order Published by NRC Research Press
658
Can. J. For. Res. Vol. 40, 2010
approximation to the correlation between two unit-level kNN estimates sharing ks reference units is ks k–1 0.78. Similar approximations can be obtained analytically for nonuniform distributions of weights.
½A2
Eij
hqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffii c BRR ð~yj Þ Mean½varð~ c yi Þ 1 c BRR ð~yi Þvar var
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c BRR ð~yi Þvar c BRR ð~yj Þ To get the the expected value of var in eq. 5 over all distinct pairs of units, we used a first-order Taylor series approximation, i.e.,
c yi Þ var½varð~ c yi Þ2 8 Mean½varð~
Inserting the approximations in eqs. A1 and A2 into the modified BRR estimator of a variance of a kNN total (eq. 6), we arrive at X c yi Þ var½varð~ c BRR ½T~ y c yi Þ 1 c yi Þ þ b ½A3 var ry~ ði; jÞMean½varð~ varð~ c yi Þ2 8 Mean½varð~ i2U
In our applications, a variance estimate obtained via eq. A3 was always within 2% of the estimate calculated directly via eq. 6. The shortcut BRR estimator also applies to small-area estimation problems by constraining the above calculations to the units in a small area. A BRR variance estimator for cluster sampling with unit-level kNN estimation Treating clusters as computational units and cluster totals of X and Y as data may not be a desired option for the analyst, since it invariably creates complex issues and questions about how to do the kNN estimation for units that cannot be naturally assembled into clusters as defined by the design. Besides, unit-level estimates are usually needed for mapping purposes and small-area estimation. We now outline how to adapt the estimator in eq. 4 to accommodate unit-level kNN estimation under a cluster sampling design. Let m denote the number of units in a cluster. In the modified BRR estimator of variance, half of the sample is used as donors of Y values to a complementary halfsample. Regard the complementary halfsample as 0.5n imputation clusters with m fixed X values and m missing Y values. The m missing Y values in an imputation cluster are now replaced by the average within-cluster Y values of two donor clusters in the halfsample that are nearest neighbours (in X space) to the imputation cluster. The distance between X values in two clusters is the sum of the m unit-level distances. After completing the imputation step, each pseudorepli-
cate is composed of m n paired values of X and Y values that are then used for unit-level kNN estimation (eq. 1) and summing to totals (eq. 2). As for SRS, the use of imputed Y values lowers the among-replication variance due to the correlation induced by the imputations. An analytical derivation of how the imputations impact the among-replication variance will be complex and difficult. Instead, we propose to estimate it directly from the L pseudoreplications. We get ½A4
c BRR:CLU ð~yi Þ ¼ covð~yil0 Þð1 þ b var rCLU Þð1 n N 1 Þ
where b rCLU is the q q matrix of a direct estimate of the average unit-level Pearson’s product moment correlation coefficient of among-unit Y values across L orthogonal and balanced pseudoreplications. No further modifications to the estimator in eq. 7 are needed. It can be argued that the above imputation scheme of unit-level Y values can be improved by searching for the two nearest donor clusters with a rotated configuration of units that minimizes their distance to the X values in the imputation cluster. With m = 4, there are 24 configurations for each donor cluster to be considered. We recommend the rotated version of the imputation procedure, as it performs better than unrotated imputations. Reference Cochran, W.G. 1977. Sampling techniques. Wiley, New York.
Published by NRC Research Press