Spatial Bayesian Models for Small Area Proportions. Fernando Moura and Helio
Migon. Federal University of Rio de Janeiro, Department of Statistics.
Spatial Bayesian Models for Small Area Proportions Fernando Moura and Helio Migon Federal University of Rio de Janeiro, Department of Statistics Po Box 68530, Postcode 21945-970, Rio de Janeiro, Brazil
[email protected],
[email protected] 1. Introduction It is well-known from many works in small area that estimators based on hierarchical models perform significantly better than the ones which do not take advantage of between area variation, see for instance, Holt and Moura(1993). Recent research works, mainly in epidemiology and geography field, has been claiming the importance of explicitly modelling the spatial structure of the data, see for example Bernadinelly and Montomoli(1992). Small area data can also exhibit a spatial structure and therefore an attempt to using spatial model should be made. This paper presents a logistic hierarchical spatial model for estimation of a small area proportion . A evaluate study with real data is also presented and comparisons with logistic hierarchical models are made.
2. The Logistic Spatial Hierarchical Model The model can be concisely written as: (1)
y ij ~ Bern( π ij )
(2)
log it π ij = β i Xij
(3)
β i | (β k , Σ , k ~ i ) ~ N( βi , n i−1 Σ )
( )
where: j and i respectively index the jth sample unit within the ith area; attribute of interest and known only for the sampled units;
y ij is the binary variable associated with the
X ij is the vector of supplementary variable data (including
the intercept term), known for the whole population; the symbol i~j means that the ith and jth small areas are neighbours;
β i = n −i 1 ∑ β k ; n i is the number of neighbours of the ith small area; Σ −1 is diagonal matrix. The i~ k
−1
elements of Σ are assumed to be gamma distributed with both parameters equal to one. This general model was named as logistic full spatial hierarchical model (HSP). In this work a particular case of the above model was also considered by allowing only the intercept term to vary according to equation (3), keeping the other coefficients constant. This model was named logistic Spatial Random Intercept Model (RSP). The objective is to make inference about the small area proportion:
θ i = N i−1 ∑ y ij + ∑ y ij , where N i is the population size in the ith small area. j∉s i j∈s i In the Bayesian framework, the inference about θ i is carried out by finding the posterior distribution p(θ i | y( s)) , where y (s ) = { y ij : i = 1,.., m; j ∈ s} is the observed sample, under the model described above.
(4)
Since these posterior cannot be obtained in closed form, Gibbs sampling is used to obtain a large sample of these posterior. Thus, posterior means and posterior variances can also be precisely estimated. 3. The data set The data set used was extracted from the Brazilian Basic Education Evaluation Census carried out in Rio de Janeiro in 1995. The whole data set contains the proficiency values of the mathematics exams applied to 15288 children. The covariates used in all the models were: the educational attainment of the child parents (0- no formal education; 1 high education), sex (1 - male; 0 - female), ethnic (1 - white; 2 others) and age (1 - less than 15; 2 - 16 and 17; 3 - greater than 17). Thirty four regions were taken to be the small areas. In each of the small area, a 10% simple random sample of children was drawn . The finite population quantities of interest θ i are the m=34 small area population proportions of children with proficiency less than a certain fixed value.
( )
4. Some Results The performance of the spatial hierarchical estimators were compared with the respective non-spatial hierarchical ones: logistic random intercept (RI) and logistic full hierarchical estimator (H). The logistic regression estimator (REG) were also included in this study. Since we known the true proportion for each of the 34 regions was possible to calculate the relative difference (Eq. 5) between the posterior means and the true proportion for the five estimators considered. Figure 1 bellow show a boxplot of these differences.
(5)
(
RD i = 100 θ i−,1true E( θ i | y ( s )) − θ i , true
)
)LJXUH5HODWLYH'LIIHUHQFHVEHWZHHQWKH3RVWHULRUPHDQVDQGWKH7UXH3URSRUWLRQV
Legend :
ο .14 –.20
• .20 –.26
• .26 - 32
• 32 - .38
• +.38
)LJXUH3RVWHULRU0HDQVRIWKH6PDOO$UHD3URSRUWLRQVIRUWKH5,DQG5630RGHOV Regarding the measures of estimator performance given in (5), it can be seen in Figure 1 that the RSP estimator presents best performance of all, while the REG estimator is the worst. Thus the logistic regression model is not recommended. One may notice, from Figure 1, that the estimates provides by RI are more unstable than its spatial counterpart (RSP), which corresponds to a widely spatial variation in left map.
REFERENCES Holt, D. and Moura, F.(1993). Small area estimation using multi-level models. In Proceedings of the Section on Survey Research Methods, American Statistical Association, 1, 21-30. Bernadinelli, L. and Montomoli, C. (1992). Empirical Bayes versus fully Bayesian analysis of geographical variation in disease risk. Statistics in Medicine, 2, 983-1007.
RÉSUMÉ Cet article presénte un modele logistique spatial hiérarchique pour l’etimation de proportions pour de petites aires. Une étude d’évaluation dáprés des données réels est aussi presentée et quelques comparisons avec les modéles logistiques hiérarchiques sont faites.