Rev Mat Complut DOI 10.1007/s13163-013-0142-2
Nonparametric estimation with mixed data types in survey sampling I. Sánchez-Borrego · J. D. Opsomer · M. Rueda · A. Arcos
Received: 25 February 2013 / Accepted: 21 November 2013 © Universidad Complutense de Madrid 2013
Abstract We consider the problem of finite population mean estimation with mixed data types. A model-assisted estimator based on nonparametric regression is proposed, which can handle discrete and continuous data and incorporates the sampling design in a natural manner. The proposed method shares the design-based properties of the kernel-based model-assisted estimator in the presence of continuous covariates and performs well under different scenarios in simulation experiments. Keywords Model-assisted estimation · Kernel regression · Survey sampling · Nonparametric regression Mathematics Subject Classification (2000)
62D05 · 62G08
1 Introduction In many surveys, limited auxiliary information is available for every element of the population of interest. It is common practice in many surveys to attempt to use this
I. Sánchez-Borrego · M. Rueda (B) · A. Arcos Department of Statistics and Operational Research, University of Granada, Granada, Spain e-mail:
[email protected] I. Sánchez-Borrego e-mail:
[email protected] A. Arcos e-mail:
[email protected] J. D. Opsomer Department of Statistics, Colorado State University, Fort Collins, USA e-mail:
[email protected]
123
I. Sánchez-Borrego et al.
information for improving the estimation of population parameters of interest. Some relevant methods of incorporating this information are the generalized regression estimators [4], calibration estimators [6], empirical likelihood methods [5] and more recently, nonparametric regression methods [3,2,25]. Qualitative and discrete variables can be encountered in a wide range of fields, including social, health and medical sciences. Auxiliary information available at the population level is often primarily or even exclusively categorical in nature, (e.g. sex, race, potential voters, education levels, occupational categories, etc), so that nonparametric approaches that can incorporate both continuous and categorical variables are much more appealing in the survey context. The goal of the current article is to explore nonparametric model-assisted estimation when the covariates include both categorical and continuous variables. Nonparametric regression methods have been used extensively for estimating the regression function in a wide range of fields and areas of research. Nevertheless, standard nonparametric regression smoothers do not handle categorical variables satisfactorily. The traditional approach to handle categorical covariates uses a frequency estimator, which involves splitting the sample into a number of subsets or cells. This approach is known to be unsatisfactory if the number of discrete cells is large relative to the sample size. In a seminal paper, Aitchison and Aitken [1] proposed a method of smoothing discrete variables for estimating probabilities in a multidimensional binary space. Other kernel methods for discrete (categorical) variables are discussed in [9,10,28], and more comprehensively in the monograph by [27]. Ouyang et al. [21] considered the problem of estimating the regression function with only discrete (categorical) variables, deriving properties to the data-driven crossvalidated smoothing parameter. Li and Racine [17] also consider a semiparametric estimator with varying-coefficients admitting both qualitative and quantitative data. Racine and Li [22] and Li and Racine [16] addressed the problem of estimation of the regression and the distribution function with mixed data types. The mixed data case under the assumption of dependence is addressed by Li et al. [14]. In a finite population sampling setting, nonparametric methods have been introduced only recently. In the special case of kernel-based methods, Kuo [13] proposed a model-based estimator of the distribution function, using the local constant polynomial regression smoother. Breidt and Opsomer [3] and Opsomer and Miller [20] introduced a model-assisted estimator and a bandwidth selector, respectively, by incorporating the sampling design to the local linear kernel smoother. However to our knowledge, no method allowing for both quantitative and qualitative data have been considered so far in this context. The current article is intended to fill this gap, by extending the estimator of [3] so that both types of data can be considered. The remainder of the article is organized as follows: Sect. 2 reviews the problem of parametric and nonparametric regression estimation in sampling surveys. Section 3 contains a description of the proposed method for estimating the population mean. The properties of the proposed method are obtained in Sect. 4. A simulation study is included in Sect. 5. The paper includes an “Appendix” in which the proofs of theorems are shown.
123
Nonparametric estimation with mixed data types in survey sampling
2 Inference in finite population sampling: framework Let U = {1, . . . , N } be the population of N units from which a random sample s of fixed size n is drawn according to a specified sampling design d = (Sd , pd ) = p (s) and second order inclusion with first order inclusion probabilities π j d s j probabilities π jk = s j,k pd (s). The first and second-order inclusion probabilities are the probabilities of obtaining the unit j and units j and k, respectively, while sampling from the population according to the sampling design d. These probabilities are very important because most of properties of the estimators are based on these quantities (see [11]). Let y j be the value of the response variable y for the jth population element. These values are not random (are fixed but unknown). The objective is to estimate the value of a function of interest θ = f (y1 , . . . , y N ). The most frequent functions are the total Y = j∈U y j or the population mean Y = N1 j∈U y j . The customary design-unbiased estimator of Y is the Horvitz–Thompson type estimator (see [12]), which is given by y H T = N −1
π −1 j yj.
j∈s
The Horvitz–Thompson estimator possesses properties of a good estimator (is admissible within the class of all unbiased estimators, see [11]) but y H T makes no use of auxiliary information. However, it is common to assume that there exists some auxiliary variables (to be obtained from census results, administrative files, etc.) which can be used at the estimation stage to improve the estimation of the parameter of interest. For this reason, we assume an auxiliary vector x of auxiliary variables whose values x j = (x1 j , . . . , xk j ), j ∈ U are known for each individual from a previous census. By using the auxiliary information, a variety of approaches are available to construct more efficient estimators for the population means and totals, including design-based and model-based methods (see eg. [26,29]). Survey sampling is perhaps unique in being the only major area of current statistical activity where inferences are primarily based on the randomization distribution rather than on statistical models for the survey outcomes. It is an area where the debate between randomization-based and modelbased inference is most sharply drawn [18]. The first estimator using auxiliary information was the ratio estimator used by Laplace as early as the mid-1780s. However the regression estimator y R E G = y H T + (X − xHT ) B,
(1)
where B = ( j∈s π1j xj xj )−1 j∈s π1j xj y j is more used in practice (a thorough review of regression estimation is given in [8]). The model-based approach rely on superpopulation models, which assume that the population under study y = (y1 , . . . , y N ) is a realization of random variables having a superpopulation model ξ . To incorporate auxiliary information xj available for all
123
I. Sánchez-Borrego et al.
j ∈ U , one may assume a superpopulation model for y built on some mean function of x: y j = m(x j ) + e j ,
j = 1, . . . , N .
(2)
The random vector e = (e1 , . . . , e N ) is assumed to have zero mean and a positive definite covariance matrix which is diagonal (y j are mutually independent). Once the model is fitted to the sample data, there are two ways to incorporate its predictions into estimation of the mean. The first is model-based and the second way is model-assisted. Prediction theory for sampling surveys (or model-based theory, see [24]) can be considered as a general framework for statistical inference on the characteristics of finite populations. Well-known estimators of population totals encountered in the classical theory, as expansion, ratio, regression, and other estimators, can be obtained as predictors in a general prediction theory, under some special model. The model-based estimators, when the model it assumes is correct, tends to be far and away better than other estimators but when the model is incorrect, the estimators have an inevitable bias, and they can do worse than even the naïve estimator. A third approach is the model-assisted approach. The superpopulation model is not the basis for inference in the model-assisted approach (it consider the properties under the design), but the model can be useful to motivate the choice of estimator. Under the model-assisted approach, there have been three important methods proposed in the literature: generalized regression estimators [4], calibration estimators [6] and empirical likelihood methods [5]. All of these methods have primarily been discussed in the context of parametric models, under a linear regression working model. Usually a parametric method is used to represent the relationship m(·) between the auxiliary data and the study variable. However the assignment of such a relationship is inappropriate or unverifiable. A natural alternative first suggested by [13] for the distribution function, is to adopt a nonparametric model-based approach, which does not place any restrictions on the relationship between the auxiliary data and the study variable. Breidt and Opsomer [3] use the traditional local polynomial regression estimator for the unknown function m(·). Let K h (u) = h −1 K (u/ h), where K denotes a continuous kernel function and h is the bandwidth. Then, a consistent prediction for the unknown m(x j ) is given by
m j = e1 (X s j Ws j X s j )−1 X s j Ws j Ys = ws j Ys ,
(3)
where e1 = (1, 0, . . . , 0) is a column vector of length p + 1, Ys = {yi }i∈s , Ws j = diag{K h (xi − x j )}i∈s and X s j = {1, (xi − x j ), . . . , (xi − x j ) p }i∈s . Thus the authors define a model-assisted estimator of the population mean by
yBO
⎞ ⎛ 1 ⎝ −1 = π j (y j − m j) + m j⎠ . N j∈s
123
j∈U
(4)
Nonparametric estimation with mixed data types in survey sampling
This estimator is well-known for its good theoretical and practical properties. A modelbased estimator is defined by [25] yMB
1 n n = yj + 1 − m j, N N N −n
(5)
j∈U −s
j∈s
that is asymptotically model unbiased and do not use the design probabilities π j . All these nonparametric estimators consider only quantitative auxiliary variables. In the next section we will consider a more general situation which include quantitative and qualitative auxiliary information. 3 Proposed method As before, let U = {1, . . . , N } be the population of N units from which a random sample s of fixed size n is drawn according to a specified sampling design with first order inclusion probabilities πi and second order inclusion probabilities πi j . Let y j be the value of the response variable y for the jth population element and x j = (x1 j , . . . , xk j ) the corresponding values of a vector x containing q discrete and p continuous auxiliary variables, with q + p = k. Let xd represent the sub-vector of q discrete variables and xc the remaining sub-vector of p continuous ones. We use xtc for the tth component of xc and xtd for the tth component of xd , respectively, and assume that xtd takes ct ≥ 2 different values in Dt = {0, 1, . . . , ct − 1}, t = 1, . . . , q. We consider a general nonparametric regression model that includes both discrete (categorical) and continuous variables: y j = m(x j ) + e j ,
j = 1, . . . , N ,
(6)
where m(·) is the unknown regression function and the e j , j = 1, . . . , N are independent and identically distributed with E(e j ) = 0 and V ar (e j ) = σ 2 for j = 1, . . . , N , and independent of the x j ’s. We denote E M , VM as the expectation and variance operators over the superpopulation model and E d , Vd as the expectation and variance operators for the sampling design d. For the discrete regressors xtd , t = 1, . . . , q, we use a variation of Aitchison and Aitken [1]’s kernel function, defined as lλ (xtid
−
xtdj )
=
1 if xtid = xtdj λ if xtid = xtdj ,
,
(7)
where λ is the smoothing parameter. The parameter λ satisfies 0 ≤ λ ≤ 1, with λ = 0 corresponding to an indicator function and λ = 1 giving equal weights to all values of its argument. The product kernel for the discrete variables is defined as K idj =
q
lλ (xtid − xtdj ).
(8)
t=1
123
I. Sánchez-Borrego et al.
For simplicity of notation, we consider a single λ for all variables xtd , but that can clearly be expanded to a separate parameter λt for each. For the continuous variables, we use K to denote a symmetric, univariate density function and define the product kernel as p 1 c Ki j = K hp
xtic − xtcj
h
t=1
,
(9)
where h is the smoothing parameter and 0 < h < ∞. The product kernel for the mixed data case is simply the product of K idj and K icj , that is, the result of combining (8) and (9), Ki j =
q t=1
lλ (xtid
−
xtdj )
p 1 K hp
t=1
xtic − xtcj h
.
(10)
This product kernel is a weighting function that depends on the “distance” between the vectors xi and x j , with the weight increasing as the distance decreases. For the categorical portion of x, the distance is defined as whether the components xtid and xtdj match or not, and the weight depends on λ. For the continuous portion, the distance is Euclidian and the weight depends on both the specific choice of the kernel function K and the bandwidth h. Because we are considering the survey context, we propose to estimate m j = m(xj ), the regression function m at the location x j (which is based on the entire finite population), by a design-weighted version of a Nadaraya-Watson [19] type kernel estimator. Specifically, we will use as a sample estimator for m j = m(xj ), Ki j yi /πi m j = i∈s . i∈s Ki j /πi
(11)
The estimator m j is a weighted average of the yi , i ∈ s, with the weight for observation i determined by the distance between x j and xi as well as the inclusion probability. Returning now to the definition of the categorical kernel in (7), we see that when j since λ = 0, observations for which any of the xtid = xtdj do not contribute to m Ki j = 0, so that we are effectively splitting the overall sample s into subsamples based on the categorical variables. If there are many categorical variables, this will result in small estimation cells and m j will be highly variable. In contrast when λ = 1, Ki j does not depend at all on the values of the xtid , and hence the effect of the categorical variables is completely ignored. This will result in a more stable estimator m j but if the categorical variables are predictive in model (6), the estimator will be less effective in improving the precision of the model-assisted estimator. Values of λ between these two extremes result in “partial pooling” of the sample across the categorical variables, which can reduce the variability of mˆ j while still capturing a portion of the predictive efficiency of these variables. We will show these effects more clearly in the simulation experiments later on.
123
Nonparametric estimation with mixed data types in survey sampling
Note that if we drop the πi ’s in (11), then the estimator is the same as in [15, section 4.4, page 137]. Moreover, if we only consider discrete regressors and λt = 0 for all t = 1, . . . , k, the discrete kernel function becomes an indicator function and m j would reduce to the frequency estimator, which involves splitting the sample into a number of subsets or ’cells’. To ensure that m j is well defined for every sample s, we consider the following adjustment for m j
Ki j yi /πi , 2 i∈s Ki j /πi + δ/N
m j =
i∈s
(12)
for small appropriate δ > 0. This adjustment was also used in [3] in the survey context and previously in [7]. We propose the following nonparametric model-assisted estimator for y¯ N = j∈U y j /N , the population mean: y np =
j 1 1 yj − m m j + . N N πj j∈U
(13)
j∈s
This is the same estimator as in [3], but the nonparametric estimator now allows for both categorical and continuous auxiliary variables. The estimator (13) depends on the values of the tuning parameters h and λ. In order to investigate a version of the estimator with data-driven tuning parameter selection, we will also consider allowing their values to be selected by minimizing the survey cross-validation criterion proposed in [20], which is defined as C V (h, λ) =
− j −i y j − m 1 πi j − πi π j yi − m , n πi j πi πj i, j∈s
where for each j ∈ s, m − j is the estimator m j when the observation (x j , y j ), j ∈ s is left out. Although a single λ is introduced for all xtd and a single h is defined for all xtc for simplicity of notation, this design-based CV criterion can also select separate multivariate tuning parameters λt and h t for each explanatory variable, at the expense of increased computational burden. 4 Properties of the estimator For simplicity of exposition of results, we consider the case p = q = 1 the theoretical x c −x c
for now, so that Ki j = h1 K i h j lλ (xid − x dj ). Extending it to the case ( p > 1, q > 1) follows with little modification. Define t jg
1 = K h i∈U
xic − x cj h
g
lλ (xid − x dj ) yi ,
(14)
123
I. Sánchez-Borrego et al.
for g = 0, 1. Using a Taylor linearization, we write m j = m j +
1 z ji + R j , πi
(15)
i∈s
where Ki j yi t j1 = , m j = i∈U K t j0 i∈U i j
(16)
R j is the remainder term and z ji = Ki j
1 (yi − m i ). t j0
(17)
We will work under the same asymptotic framework as in [3]. To prove our theoretical results, we make the following assumptions. 1. There exists M y < ∞ such that |y j | ≤ M y . 2. For each N , the x cj are considered fixed with respect to the superpopulation model. x They are independent and identically distributed with F(x) = −∞ f (t)dt, where f (·) is a density with compact support [ax , bx ] and f (x) > 0 for all x ∈ [ax , bx ]. 3. The kernel K (·) has compact support [−1, 1], is symmetric and continuous, and 1 satisfies −1 K (u)du = 1. 4. As N → ∞, n N −1 → π ∈ (0, 1), h → 0 and N h 2 /(log log N ) → ∞ and λ → 0, and λ satisfies 0 ≤ λ ≤ 1. 5. For all N , min j∈U π j ≥ M1 > 0, mini, j∈U πi j ≥ M2 > 0 and lim sup n
N →∞
max
i, j∈U :i= j
|πi j − πi π j | < ∞.
(18)
6. Additional assumptions involving higher-order inclusion probabilities: lim n 2
N →∞
max
(i 1 ,i 2 ,i 3 ,i 4 )∈D4,N
|E d [(Ii1 − πi1 )(Ii2 − πi2 )(Ii3 − πi3 )(Ii4 −πi4 )]| < ∞, (19)
where Dt,n denotes the set of all distinct t-tuples (i 1 , i 2 , . . . , i t ) from U , lim
max
|E d [(Ii1 Ii2 − πi1 i2 )(Ii3 Ii4 − πi3 i4 )]| = 0,
max
|E d [(Ii1 − πi1 )2 (Ii2 − πi2 )(Ii3 − πi3 )]| < ∞.
N →∞ (i 1 ,i 2 ,i 3 ,i 4 )∈D4,N
(20)
and lim sup n
N →∞
(i 1 ,i 2 ,i 3 )∈D3,N
(21)
123
Nonparametric estimation with mixed data types in survey sampling
These assumptions follow assumptions (A1)–(A7) as in [3], except for the following two modifications. First, assumptions (A1) and (A3) are replaced by the assumption that the y j are uniformly bounded, i.e. there exists M y < ∞ such that |y j | ≤ M y . This assumption is weaker than (A1) and (A3) but is sufficient here, since we only address the design-based properties of the estimator in the current article. Second, as previously noted, we assume that λ satisfies 0 ≤ λ ≤ 1. The x cj are considered fixed with respect to the superpopulation model in the second assumption, to make the results look like traditional finite population results. This assumption ensures, however, that the x cj are a random sample from a continuous distribution. The fourth one establishes assumptions such as h → 0 and N h 2 /(loglog N ) → ∞ as N → ∞, which are well-known in the general nonparametric regression context (using only continuous variables) in order to obtain a consistent estimator. For the discrete variable case we need λ → 0, while for the mixed continuous and discrete variable case we need both sets of conditions. As is commonly done in survey estimation asymptotics, n N −1 → π ∈ (0, 1) is also required to prove our results. The fifth assumption provides bounding arguments for the π j and the πi j , that we need to demonstrate some theoretical properties of the proposed estimator. The estimator (13) is a linear combination of the sample y j , j ∈ s. If we ignore the small order adjustment δ/N 2 in the denominator, these linear combination weights are calibrated to the population size N , but not to any other quantities. This is in contrast to the local polynomial estimator in [3], and is due to the use of the product kernel. In practice, calibration to the population size is the most important form of calibration. The following result shows that the proposed estimator is asymptotically design unbiased and design consistent. Theorem 1 Under assumptions 1–6, y np is asymptotically design unbiased in the sense that lim E d (y np − Y ) = 0,
(22)
N →∞
with probability 1, and is design consistent in the sense that lim E d (I{|y np −Y |> ε} ) = 0,
(23)
N →∞
with probability 1 ∀ ε > 0. In the following result, an expression for the asymptotic mean square error for the proposed estimator (13) is derived. Theorem 2 Assume 1–6. Then, n E d (y np − Y )2 =
πi j − πi π j n (yi − m i )(y j − m j ) + o(1). 2 N πi π j
(24)
i, j∈U
123
I. Sánchez-Borrego et al.
This theorem implies that the estimator y np is asymptotically equivalent to the difference estimator y diff =
1 1 yj − m j mj + . N N πj j∈U
(25)
j∈s
Theorem 3 Assume 1–6. Then the variance estimator (πi j − πi π j ) yi − mˆ i y j − mˆ j ( y¯np ) = 1 V , 2 N πi j πi πj
(26)
i, j∈s
is design consistent for the asymptotic design mean squared error of y¯np . In the following result, the limiting distributional properties of the generalized difference estimator hold to the estimator y np . For any design for which y¯diff is asymptotically normally distributed, the estimator y¯np is likewise asymptotically normally distributed. Theorem 4 Assume 1–6. For any design for which y¯diff is asymptotically normally distributed, then y np − Y →L N (0, 1), V 1/2 (y np )
(27)
as N → ∞. Proofs of these results can be found at the Appendix. 5 Simulation results In this section, simulations experiments are carried out to illustrate the performance of the proposed method under different scenarios. The main factor of interest will be the effect of the inclusion of categorical variables in both the underlying population structure and in the model-assisted estimator. More comprehensive results that compare other population structures with only continuous covariates and additional estimators are in [3]. We consider three simulated populations obtained by assuming model (6) with mean functions: Linear1: m 1 (x) = x1c + βx1d + βx2d Linear2: m 2 (x) = x1c + βx1d + βx2d + βx1c x1d + βx1c x2d + βx1d x2d
(28)
Bump: m 3 (x) = 1 + x1c + 2(x1c − 0.5) + exp(−200(x1c − 0.5)2 ) + βx1d + βx2d ,
123
Nonparametric estimation with mixed data types in survey sampling
where x1c is taken as an uniform variable with support within the interval [0,1], x1d and x2d are independent binary covariates with P[xtd = 1] = 0.25, t = 1, 2 and the same parameter β is used for all the categorical variables for simplicity. The errors are generated as independent normal random variables with mean 0 and σ = 0.5. In order to investigate the effect of the inclusion of categorical covariates on the model-assisted estimator, we evaluate three possible values for coefficient β, β = 0.25, β = 0.5 and β = 1. Populations of size 1,000 are generated according to these specifications, and samples of size n = 75 were selected by simple random sampling without replacement from these populations. The simulations were also performed for different sample sizes, but the results were qualitatively similar and hence are not reported here. The proposed method is evaluated in the estimation of the population mean, for a range of different fixed values of the tuning parameters h and λ. The proposed estimator is computed using the Epanechnikov kernel function for the continuous variable and the Aitchison and Aitken [1] kernel in (7) for the discrete variables. We considered three fixed continuous bandwidths h = 0.1, 0.15, 0.2, and for the categorical variables, the values λ = 0, 0.25, 0.5, 0.75, 1. In addition to the estimator obtained for those fixed values of the bandwidth parameters, we also applied the survey cross-validation (CV) criterion of Opsomer and Miller [20] to choose among the same set of 15 possible values for pairs of h and λ. Finally, we compare the nonparametric estimators empirically to the difference estimator (25), except that we set m j equal to the true regression function. This estimator is unfeasible, because it requires knowledge of the unknown mean function, but it provides a useful target for comparison when the function m is estimated from the sample data. We investigated the relative bias (RB) and improvement in relative efficiency compared to the Horvitz–Thompson estimator (RE), where B 1 y np (sb ) − y N × 100, B yN b=1
B 2
MSE(y np ) b=1 y np (sb ) − y N × 100 = RE(y np ) = 2 × 100, B MSE(y HT ) y (s ) − y b HT N b=1
RB(y np ) =
with B the number of simulation replicates and MSE the mean square error. Simulation results are based on B = 1,000 samples drawn from a population of size N=1,000 under simple random sampling without replacement (π j = n/N and π jk = n(n − 1)/N (N − 1)). The calculations were obtained using the R program. Programming details are available from the first author. The estimator is approximately unbiased, as relative biases are less than 1 % for all scenarios considered and choices of bandwidths, so the exact results are not shown here for brevity. Tables 1, 2 and 3 show the RE results for the three different mean functions. As expected, the nonparametric estimator is always less efficient than the (unfeasible) difference estimator, showing the “price” paid due to estimation of the mean function. Nevertheless, all the RE values are less than 100 %, indicating that the nonparametric
123
I. Sánchez-Borrego et al. Table 1 Relative efficiency in percent (RE) of estimators y np and y diff to the Horvitz–Thompson estimator (y HT ) for Linear simulated populations β
0.25
0.50
1.00
y diff
71.46
61.35
36.04
h
y np
y np
λ=0
λ = 0.25
λ = 0.5
λ = 0.75
λ=1
CV
0.10
86.63
84.05
82.94
83.02
84.09
80.71
0.15
84.05
81.53
80.87
81.49
82.80
0.20
82.05
80.09
79.75
80.63
82.06
0.10
87.05
84.65
84.62
86.12
89.15
0.15
81.31
79.22
79.94
82.30
85.66
0.20
78.19
76.55
77.56
80.33
83.92
0.10
68.49
73.25
78.84
85.95
96.38
0.15
64.53
66.41
73.14
82.21
93.46
0.20
63.01
64.00
71.03
81.13
92.87
78.28
63.40
Sample size is n = 75. The most efficient nonparametric estimator among the fixed-bandwidth alternatives in each scenario is denoted in bold Table 2 Relative efficiency in percent (RE) of estimators y np and y diff to the Horvitz–Thompson estimator (y HT ) for Linear2 simulated populations β
0.25
0.50
1.00
y diff
55.99
38.98
14.09
h
y np
y np
λ=0
λ = 0.25
λ = 0.5
λ = 0.75
λ=1
CV
0.10
79.62
78.19
78.72
80.56
83.70
73.50
0.15
75.91
74.30
75.08
77.28
80.40
0.20
74.35
73.03
73.75
75.98
79.02
0.10
69.53
70.42
73.35
78.52
86.81
0.15
66.84
66.25
69.64
75.57
83.64
0.20
64.74
64.16
67.76
74.04
82.04
0.10
62.64
67.33
74.74
83.50
95.59
0.15
59.45
61.95
70.70
81.39
93.91
0.20
57.64
59.70
68.86
80.26
92.87
64.38
57.83
Sample size is n = 75. The most efficient nonparametric estimator among the fixed-bandwidth alternatives in each scenario is denoted in bold
estimators are all more efficient than an estimator that ignores the auxiliary information altogether. Among the fixed-bandwidth estimators, those with λ = 1 correspond to the estimator considered in [3], which does not capture the categorical covariates in the estimation of m. As the results in the tables show, that estimator is always dominated by at least some of the corresponding estimators with λ < 1 across all the scenarios considered, showing that the inclusion of the discrete covariates improves the efficiency of the regression estimator. At the same time, estimating the mean function separately for each value of the discrete covariates, i.e. setting λ = 0 is also inefficient, except when the effect of the discrete covariates is very large, or β = 1 in our scenarios. For most
123
Nonparametric estimation with mixed data types in survey sampling Table 3 Relative efficiency in percent (RE) of estimators y np and y diff to the Horvitz–Thompson estimator (y HT ) for Bump simulated populations β
y diff
0.25
h
24.40
0.50
21.75
1.00
15.90
y np
y np
λ=0
λ = 0.25
λ = 0.5
λ = 0.75
λ=1
CV
0.10
31.97
28.87
28.25
27.98
28.04
28.50
0.15
30.59
28.66
28.13
28.00
28.09
0.20
30.71
29.19
28.68
28.60
28.74
0.10
29.15
27.44
27.75
28.51
29.72
0.15
28.57
27.59
28.06
28.98
30.21
0.20
29.16
28.64
29.10
30.07
31.32
0.10
39.06
40.64
40.64
48.71
53.82
0.15
36.94
37.62
41.79
46.62
52.08
0.20
36.69
37.36
41.43
46.50
37.47
27.91
36.80
Sample size is n = 75. The most efficient nonparametric estimator among the fixed-bandwidth alternatives in each scenario is denoted in bold
scenarios considered, a value of λ between the two extremes, corresponding to partial pooling of the categorical covariates, results in the smallest RE values, demonstrating the usefulness of the proposed approach. When considering the nonparametric estimator with h and λ selected by crossvalidation, the tables show that in all scenarios considered, the estimator is very close to the smallest possible value among all the fixed-bandwidth estimators. This shows that cross-validation is successful in finding values for h and λ that are close to optimal, even for relatively small sample sizes like n = 75. In conclusion, the simulations indicate that the nonparametric estimator is effective in increasing the efficiency of model-assisted estimators by incorporating both continuous and discrete auxiliary variables. In addition, cross-validation appears to provide a simple and convenient way to select values of the tuning parameters for the nonparametric estimator. Acknowledgments This work is partially supported by Ministerio de Educación y Ciencia (contract No. MTM2009-10055 and MTM2012-35650).
Appendix Proof of Theorem 1 We write y np − Y =
Ij Ij 1 1 . (29) (y j − m j ) −1 + ( m j − m j) 1 − N πj N πj j∈U
j∈U
Then
y − m I j j j E d y np − Y ≤ E d − 1 N πj j∈U
123
I. Sánchez-Borrego et al.
⎧ ⎨
⎡ ⎤ ⎤⎫1/2 2 ⎬ ( (1 − π −1 I ) j m j − m j )2 j ⎦ Ed ⎣ ⎦ + Ed ⎣ . ⎩ ⎭ N N ⎡
j∈U
(30)
j∈U
Under assumptions 1–6, there exists n 0 , such that n ≥ n 0 implies lemma 2(ii) in [3],
I{|x c −xic |≤h} ≥ 1,
(31)
i∈U
which holds by using the fact that λ ≤ lλ (xid − x dj ) ≤ 1. We write
c 1 xi − x cj I i p d d 1 t jg /N = lim sup K lim sup lλ (xi − x j )yi N →∞ N →∞ Nh h πi i∈U C ≤ lim sup I{x c −h≤xic ≤x cj +h|} , (32) N →∞ NhM j i∈U
with p1 = 0, 1 and using λ ≤ lλ (xid − x dj ) ≤ 1. As M < 1, the same bound is valid for t jg /N . Therefore the N −1 t jg are uniformly bounded in j and the N −1 t jg are uniformly bounded in j and s. The m j are continuous functions of the t jg with denominators uniformly bounded away from 0 by (31). Analogously, the m j are continuous functions of the uniformly bounded t jg , well defined by the δ/N 2 adjustment in (12). Hence, j are uniformly bounded in j and s. the m j are uniformly bounded in j and the m Under assumptions 1–6, we also have lim sup
N →∞
1 (y j − m j )2 < ∞, N
(33)
j∈U
and the first term in (30) converges to 0 as N → ∞, following the argument of Theorem 1 in [23]. Under assumption 5, ⎛ Ed ⎝
2 (1 − π −1 j Ij)
N
j∈U
⎞ ⎠=
π j (1 − π j ) 1 . ≤ 2 M Nπj j∈U
(34)
Using λ ≤ lλ (xid − x dj ) ≤ 1, lemmas 3, 4 and 5 in [3] continue to hold. Then, using lemma 4, lim
N →∞
123
1 E d ( m j − m j )2 = 0, N j∈U
(35)
Nonparametric estimation with mixed data types in survey sampling
and by (34) the second term in (30) converges to 0 as N → ∞. By the Markov inequality the theorem follows. Proof of Theorem 2 Let aN = n
1/2
yj − m j Ij mj −m j Ij 1/2 − 1 , bN = n − 1 , (36) N πj N πj j∈U
j∈U
so that n E d (y np − Y )2 = E d (a 2N ) + E d (b2N ) + 2E d (a N b N ).
(37)
It is immediate that E d (b2N ) = o(1) by lemma 5 in [3], that also follows because of λ ≤ lλ (xid − x dj ) ≤ 1. E d (a 2N ) is bounded because V ard (a N ) = O(1), as V ard (n −1/2 a N ) = O(1/n) for the Horvitz–Thompson mean estimator. Then, E d (a N b N ) ≤ (E d (a 2N )E d (b2N ))1/2 = o(1).
(38)
n E d (y np − Y )2 = E d (a 2N ) + o(1),
(39)
Hence,
and the theorem follows.
Proof of Theorems 3 and 4 Under conditions 1–6 and using again that λ ≤ lλ (xid − x dj ) ≤ 1, the asymptotic normality of the generalized difference estimator follows and a variance consistent estimator of the asymptotic mean square error can be derived, analogously as Theorems 3 and 4 in [3]. References 1. Aitchison, J., Aitken, C.G.G.: Multivariate binary discrimination by the kernel method. Biometrika 63, 413–420 (1976) 2. Breidt, F.J., Claeskens, G., Opsomer, J.D.: Model-assisted estimation for complex surveys using penalised splines. Biometrika 92(4), 831–846 (2005) 3. Breidt, F.J., Opsomer, J.D.: Local polynomial regression estimators in survey sampling. Ann. Stat. 28, 1026–1053 (2000) 4. Cassel, C.M., Särndal, C.E., Wretman, J.H.: Foundations of Inference in Survey Sampling. Wiley, New York (1977) 5. Chen, J., Qin, J.: Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika 80, 107–116 (1993) 6. Deville, J.C., Särndal, C.E.: Calibration estimators in survey sampling. J. Am. Stat. Assoc. 87, 376–382 (1992) 7. Fan, J.: Local linear regression smoothers and their minimax efficiencies. Ann. Stat. 21, 196–216 (1993) 8. Fuller, W.A.: Regression estimation for survey samples. Surv. Methodol. 28, 5–23 (2002) 9. Hall, P.: On nonparametric multivariate binary discrimination. Biometrika 68, 287–294 (1981)
123
I. Sánchez-Borrego et al. 10. Hall, P., Wand, M.P.: On nonparametric discrimination using density differences. Biometrika 75, 541– 547 (1988) 11. Hedayat, A., Sinha, B.: Design and Inference in Finite Population Sampling. Wiley, New York (1991) 12. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 663–685 (1952) 13. Kuo, L.: Classical and prediction approaches to estimating distribution functions from survey data. In: ASA Proceedings of the Section on Survey Research Methods, pp. 280–285. American Statistical Association, Alexandria, VA (1988) 14. Li, Q., Ouyang, D., Racine, J.S.: Nonparametric with weakly dependent data: the discrete and continuous regressor case. J. Nonparametr. Stat. 21(6), 697–711 (2009) 15. Li, Q., Racine, J.S.: Nonparametric Econometrics: Theory and Practice. Princeton University Press, New Jersey (2007) 16. Li, Q., Racine, J.S.: Nonparametric estimation of conditional cdf and quantile functions with mixed categorical and continuous data. J. Bus. Econ. Stat. 26(4), 423–434 (2008) 17. Li, Q., Racine, J.S.: Smooth varying-coefficient estimation and inference for qualitative and quantitative data. Econ. Theory 26, 1607–1637 (2010) 18. Little, R.J.A.: To model or not to model? Competing modes of inference for finite population sampling. J. Am. Stat. Assoc. 99(466), 546–556 (2004) 19. Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 9, 141–142 (1964) 20. Opsomer, J.D., Miller, C.P.: Selecting the amount of smoothing in nonparametric regression estimation for complex surveys. J. Nonparametr. Stat. 17, 593–611 (2005) 21. Ouyang, D., Li, Q., Racine, J.S.: Nonparametric estimation of regression functions with discrete regressors. Econ. Theory 25(1), 1–42 (2009) 22. Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econ. 119, 99–130 (2004) 23. Robinson, P.M., Särndal, C.E.: Asymptotic properties of the generalized regression estimator in probability sampling. Sankhya, Series B, vol. 45, pp. 240–248 (1983) 24. Royall, R.M.: The prediction approach to sampling theory. In: Krishnaiah, P.R., Rao, C.R. (eds.) Handbook of Statistics, vol. 6, pp. 399–413. Elsevier, Amsterdam (1988) 25. Rueda, M., Sánchez-Borrego, I.: A predictive estimator of finite population mean using nonparametric regression. Comput. Stat. 24, 1–14 (2009) 26. Särndal, C.E., Swensson, B., Wretman, J.: The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika 76, 527–537 (1989) 27. Simonoff, J.S.: Smoothing Methods in Statistics. Springer, New York (1996) 28. Titterington, D.M.: A comparative study of kernel-based density estimates for categorical data. Technometrics 22, 259–268 (1980) 29. Valliant, R., Dorfman, A.H.: Finite Population Sampling and Inference: A Prediction Approach. Wiley, New York (1999)
123