.svy supports many estimation commands: â¡ mean, proportion, ratio, total. â¡ cnreg, cnsreg, glm, intreg, nl, ols, tob
Survey commands in STATA
Carlo Azzarri DECRG
Sample survey: Albania 2005 LSMS
4 strata (Central, Coastal, Mountain, Tirana)
455 Primary Sampling Units (PSU)
8 HHs by PSU * 455 = 3,640 HHs
svy command: general syntax . svyset PSU [pw=popw], str(stratum) pweight: VCE: Single unit: Strata 1: SU 1: FPC 1:
“popw” linearized missing “stratum” “PSU”
.svy supports many estimation commands:
mean, proportion, ratio, total cnreg, cnsreg, glm, intreg, nl, ols, tobit, treatreg, truncreg stcox, streg probit, logit, biprobit, cloglog,… clogit, mlogit, mprobit, oloig oprobit, slogit nbreg, poisson, zip, zinp. Ivols, ivprobit, ivtobit. Heckman, heckprob
Examples average proportion model estimation
average . mean TOTPCCONS [pw=popw], over(urban) Mean estimation
Number of obs
=
3638
Urban: urban = Urban Rural: urban = Rural -------------------------------------------------------------Over | Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------TOTPCCONS | Urban | 12094.48 203.4169 11695.66 12493.3 Rural | 8160.521 125.9879 7913.507 8407.535 --------------------------------------------------------------
svy command: average . svy: mean TOTPCCONS, over(urban) (running mean on estimation sample) Survey: Mean estimation Number of strata = Number of PSUs =
4 455
Number of obs Population size Design df
= = =
3638 3068195 451
Urban: urban = Urban Rural: urban = Rural -------------------------------------------------------------| Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------TOTPCCONS | Urban | 12094.48 296.8412 11511.12 12677.84 Rural | 8160.521 209.3881 7749.024 8572.018 --------------------------------------------------------------
svy command: average (DEFF) . estat effects, deff srssubpop
(in Stata 10…)
Urban: urban = Urban Rural: urban = Rural -----------------------------------------------| Linearized Over | Mean Std. Err. DEFF -------------+---------------------------------TOTPCCONS | Urban | 12094.48 296.8412 2.82798 Rural | 8160.521 209.3881 3.5515 ------------------------------------------------
This value means that the sample variance is 2.8 times bigger than it would be if the survey were based on the same sample size but selected randomly
differences?
mean is the same std. error is higher C.I. widens urban C.I. Æ 11,696-12,493 (w/out sampling design) Æ 11,511-12,678 (w/ sampling design) statistical difference between groups less likely because of overlap (not in this case) design effect
average (test) . ttest TOTPCCONS [aw=popw], by(urban) . reg TOTPCCONS urban [aw=popw] (sum of wgt is 3.0682e+06)
Source | SS df MS -------------+-----------------------------Model | 1.3869e+10 1 1.3869e+10 Residual | 1.4076e+11 3636 38713907 -------------+-----------------------------Total | 1.5463e+11 3637 42516582.5
Number of obs F( 1, 3636) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
3638 358.24 0.0000 0.0897 0.0894 6222.1
-----------------------------------------------------------------------------TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------urban | -3933.959 207.8452 -18.93 0.000 -4341.463 -3526.454 _cons | 16028.44 340.3616 47.09 0.000 15361.12 16695.76 ------------------------------------------------------------------------------
svy command: average (test) . svy: mean TOTPCCONS, over(urban) . lincom [TOTPCCONS]Urban-[TOTPCCONS]Rural ( 1)
[TOTPCCONS]Urban - [TOTPCCONS]Rural = 0
-----------------------------------------------------------------------------| Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 3933.959 365.0498 10.78 0.000 3216.549 4651.368 ------------------------------------------------------------------------------
3,934 = 12,094 (urban) - 8,160 (rural)
model estimation (w/ dummy) . svy: reg TOTPCCONS urban Survey: Linear regression Number of strata Number of PSUs
= =
4 455
Number of obs Population size Design df F( 1, 451) Prob > F R-squared
= 3638 = 3068194.7 = 451 = 116.13 = 0.0000 = 0.0897
-----------------------------------------------------------------------------| Linearized TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------urban | -3933.959 365.0498 -10.78 0.000 -4651.368 -3216.549 _cons | 16028.44 631.5923 25.38 0.000 14787.21 17269.67 ------------------------------------------------------------------------------
proportion . proportion poor [pw=popw], over(urban) Proportion estimation
Number of obs
=
3638
no: poor = no yes: poor = yes Urban: urban = Urban Rural: urban = Rural -------------------------------------------------------------Over | Proportion Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------no | Urban | .8881416 .009669 .8691843 .9070988 Rural | .7575375 .013601 .7308709 .7842041 -------------+-----------------------------------------------yes | Urban | .1118584 .009669 .0929012 .1308157 Rural | .2424625 .013601 .2157959 .2691291 --------------------------------------------------------------
svy command: proportion . svy: mean poor, over(urban) (running mean on estimation sample) Survey: Mean estimation Number of strata = Number of PSUs =
4 455
Number of obs Population size Design df
= 3638 = 3068195 = 451
Urban: urban = Urban Rural: urban = Rural -------------------------------------------------------------| Linearized Over | Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------poor | Urban | .1118584 .0108146 .0906052 .1331116 Rural | .2424625 .0195528 .2040366 .2808883 --------------------------------------------------------------
svy command: proportion (DEFF) . estat effects, deff srssubpop
(in Stata 10…)
Urban: urban = Urban Rural: urban = Rural -----------------------------------------------| Linearized Over | Mean Std. Err. DEFF -------------+---------------------------------poor | Urban | .1118584 .0108146 2.35214 Rural | .2424625 .0195528 3.40943 ------------------------------------------------
Only 1/2.35 as many observations would be needed to measure the urban PHC if a simple random sample were used (instead of the cluster sample with the design effect of 2.35)
proportion (test) . ttest
poor [aw=popw], by(urban)
. reg poor urban [aw=popw] (sum of wgt is 3.0682e+06)
Source | SS df MS -------------+-----------------------------Model | 15.286226 1 15.286226 Residual | 533.38963 3636 .146696818 -------------+-----------------------------Total | 548.675856 3637 .15085946
Number of obs = F( 1, 3636) = Prob > F = R-squared = Adj R-squared = Root MSE =
3638 104.20 0.0000 0.0279 0.0276 .38301
-----------------------------------------------------------------------------poor | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------urban | .1306041 .0127943 10.21 0.000 .1055193 .1556888 _cons | -.0187456 .0209516 -0.89 0.371 -.0598237 .0223325 ------------------------------------------------------------------------------
svy command: proportion (test) . svy: mean poor, over(urban) . lincom [poor]Urban-[poor]Rural ( 1)
[poor]Urban - [poor]Rural = 0
-----------------------------------------------------------------------------| Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | -.1306041 .0223587 -5.84 0.000 -.1745441 -.086664 ------------------------------------------------------------------------------
model estimation (actual sample) . reg TOTPCCONS TOTPCINCOME [pw=popw] -----------------------------------------------------------------------------| Robust TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------TOTPCINCOME | .1676896 .0150931 11.11 0.000 .1380979 .1972814 _cons | 7680.692 200.8623 38.24 0.000 7286.878 8074.506 -----------------------------------------------------------------------------. svy: reg TOTPCCONS TOTPCINCOME (running regress on estimation sample) -----------------------------------------------------------------------------| Linearized TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------TOTPCINCOME | .1676896 .0150648 11.13 0.000 .1380838 .1972954 _cons | 7680.692 238.9625 32.14 0.000 7211.074 8150.31 ------------------------------------------------------------------------------
model estimation (actual sample) Standard E rrors
0
S.E. of the prediction 1000 2000 3000
4000
w/ a ctual s am ple
0
10 000 0
200 000 T OT PCIN CO ME
3 000 00
model estimation (4 times actual sample) Standard E rrors
0
S.E. of the prediction 1000 2000 3000
4000
w/ 4 tim es the actual samp le
0
10 000 0
200 000
3 000 00
T OT PCIN CO ME T wo -sta ge s trat ified
SRS
model estimation (4 times actual sample) . reg TOTPCCONS TOTPCINCOME [pw=popw] -----------------------------------------------------------------------------| Robust TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf.Interval] -------------+---------------------------------------------------------------TOTPCINCOME | .1676896 .007545 22.23 0.000 .1529005 .1824787 _cons | 7680.692 100.4104 76.49 0.000 7483.875 7877.509 ------------------------------------------------------------------------------
. svy: reg TOTPCCONS TOTPCINCOME (running regress on estimation sample) -----------------------------------------------------------------------------| Linearized TOTPCCONS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------TOTPCINCOME | .1676896 .0150648 11.13 0.000 .1380838 .1972954 _cons | 7680.692 238.9625 32.14 0.000 7211.074 8150.31 ------------------------------------------------------------------------------
Main message
“Respondents in the same cluster are likely to be somewhat similar to one another”. As a result, in a clustered sample “selecting an additional member from the same cluster adds less new information than would a completely independent selection” (Health Survey for England: The Health of Young People '95 – 97)
Statistics and parameters do not differ (as long as weights are used), but standard errors do, so… …always take sampling design into account, otherwise inaccurate/wrong inference