On calibration methods for design based finite population inferences Giorgio E. Montanari & M. Giovanna Ranalli Department of Statistical Sciences, University of Perugia, Via A. Pascoli - 06100 Perugia - Italy -
[email protected] 1. Calibration and Model Calibration Availability of auxiliary information to estimate parameters of interest of a survey variable has become fairly common: census data, administrative registers and previous surveys provide a wide and growing range of variables eligible to be employed to increase the precision of the estimation procedure for a finite population mean or total. A common way to take advantage from the knowledge of finite population means (or totals) of auxiliary variables is through calibration estimation (see e.g. Wu and Sitter, 2001). Consider a finite population U = {1, . . . , N }. A sample s of elements is drawn from U according to a probabilistic sampling plan with inclusion probabilities πi and πij . The survey variable y and Q auxiliary variables x are observed for each unit in the sample, hence yi and the row vector xi = (x1i , . . . , xQi ) are known ∀i ∈ s. The goal is to estimate the population mean of the survey variable, that P is Y¯ = N −1 U yi . A calibration estimator is a linear combination of the observations Yˆ¯ c = P P N −1 s wi yi , with weights wi chosen to minimize the distance measure Φs = s (wi −di )2 /(di qi ) P from the basic design weights di = 1/πi under the constraint N −1 s wi xi = x ¯, where x ¯ is the ˆ known population mean of x and the qi ’s are known positive constants. Indeed, Y¯ c is implicitly assuming a linear relationship between the auxiliary variables and the survey variable. A new approach, model calibration, has been recently proposed by Wu and Sitter (2001) to consider generalized linear models and nonlinear parametric models to obtain model-assisted estimators by generalizing the calibration method. Assume that the value of the vector x is available for each unit in the population; hence xi is known ∀i ∈ U. Let y1 , . . . , yN be a random sample from a superpopulation ξ such that Eξ (yi ) = µ(xi , θ), where θ is a vector of unknown model parameters, µ(·) is a known function and Eξ denotes expectation with respect to ξ. The P proposed model calibration estimator for Y¯ is defined to be Yˆ¯ mc = N −1 s wi yi , with weights P again sought to minimize the distance measure Φs , but now under the constraints s wi = N P P ˆ and θˆ is a design-based estimator for the model and s wi µ ˆi = U µ ˆi , where µ ˆi = µ(xi , θ), parameter θ. Wu (2002) shows that the resulting estimator is optimal within the class of calibration estimators, in that it has minimum expected asymptotic design variance under the superpopulation model ξ and any regular sampling design with fixed sample size. Although far richer than a linear regression model, the class of superpopulations considered for Yˆ¯ mc has been further enlarged to allow for more general models. Montanari and Ranalli (2003) introduce a nonparametric model calibration estimator for Y¯ based on neural network learning. In particular, they consider the model (1)
Eξ (yi ) = f (xi ),
Vξ (yi ) = σ 2 v(xi )2 ,
i = 1, . . . , N,
Cξ (yi , yj ) = 0,
i 6= j,
where f (xi ) takes the form of a feedforward neural network with skip-layer connections and Vξ and Cξ denote variance and covariance with respect to ξ. Fitted values fˆyi are obtained by means of a neural network approximator, that has been modified in order to account for the sampling P design. Then, a neural network model calibration estimator is defined as Yˆ¯ nn = N −1 s wi yi , where the calibrated weights wi are sought to minimize the distance measure Φs under the P P P constraints s wi = N and s wi fˆyi = U fˆyi . Design consistency has been proved for Yˆ¯ nn and simulation studies have shown good gains in efficiency of this estimator with respect to other parametric and nonparametric estimators.
However, the set of weights obtained for Yˆ¯ nn - and generally for Y¯ˆ mc - are not calibrated with respect to x ¯, that is applying this set of weights to the auxiliary variables will not reproduce their known mean (Problem 1). Moreover, if more than one survey variable is of concern, a different set of weights arises for each of them (Problem 2). Overcoming these problems is usually required for internal consistency and aligning estimates coming from different sources. In what follows a solution to these problems is proposed, while maintaining the asymptotic optimality of model calibration estimators as in Wu (2002). 2. A larger class of optimal nonparametric model calibration estimators Assume the superpopulation model in equation (1). A larger class of neural network model ∗ P calibration estimators is given by Yˆ¯ nn = N −1 s wi yi , where weights are sought to minimize P P P the distance measure Φs under the usual constraints s wi = N and s wi fˆyi = U fˆyi , and P P under the additional one s wi ci = U ci , where ci is the value in unit i of a vector of suitably chosen variables. Now, Problem 1 can be overcome by the choice ci = xi , while Problem 2 can be managed by letting ci = fˆi , where fˆi contains the fitted values of other survey variables for which model calibration is required. Both problems can be handled by letting ci = [xi fˆi ]. By gathering all the variables employed in the minimization procedure in the row vector ui = [1 ci fˆyi ], the resulting estimator can be written as (2)
∗
P P P Yˆ¯ nn = N −1 s di yi + N −1 ( U ui − s di ui ) βˆu ,
where βˆu = ( s di qi u0i ui )−1 s di qi u0i yi . Under regularity conditions as in Montanari and Ranalli (2003) on the function f (·) and assuming an asymptotic framework as in Wu (2002), it ∗ can be can proved that Yˆ¯ nn is design consistent and approximately design unbiased. Moreover, ∗ it can be proved that Yˆ¯ nn achieves the same lower bound as Yˆ¯ nn for the expected design variance ∗ under model (1), that is Eξ (AV (Yˆ¯ nn )) = Eξ (AV (Yˆ¯ nn )) where AV (·) denotes the asymptotic design variance of an estimator. ∗ In conclusion, the set of weights wi defined for Yˆ¯ nn can be calibrated with respect to the auxiliary variables x and can be the same for all survey variables without loosing the property of model calibration – at least for the most important survey variables. Although asymptotical results show that no efficiency is lost by enlarging the set of variables upon which calibration is conducted, empirical evidence is needed to investigate the effects of this procedure for finite samples. We will compare the finite sample performance of different estimators by means of a simulation study. P
P
REFERENCES Montanari, G.E. and Ranalli, M.G. (2003). Nonparametric model calibration estimation in survey sampling, Working paper 2003–01, Department of Statistical Sciences, University of Perugia, Italy. Wu, C. (2002). Optimal calibration estimators in survey sampling, Working paper 2002– 01, Department of Statistics and Actuarial Science, University of Waterloo, Canada. Wu, C. and Sitter, R.R. (2001). A model-calibration to using complete auxiliary information from survey data, Journal of the American Statistical Association, 96, 185–193. ´ ´ RESUM E Dans cet article on propose une classe ´etendue des estimateurs pour param`etres d’une population finie, fond´es sur reseaux neuronals et calibr´es sur les variables auxiliares d’int´eret. On ´etude les propri´etes asymptotiques par rapport au plan de sondage et au mod`ele de superpopulation.