create a program that was fully integrated into SAS, so that the user could ... units of analysis. A âcomplexâ ... software that produces proper estimates under the.
Replicate-Based Variance Estimation in a SAS® Macro Julia L. Bienias Rush-Presbyterian-St. Luke's Medical Center, Chicago, IL
ABSTRACT
regression models (see SAS Institute Inc., 2000).
I describe a program to compute design-based variances (i.e., variances adjusted for the sample design) for linear regression models, logistic regression models, and mixed models for a 2-PSUs-per-stratum complex sampling design. The program allows the user to choose balanced repeated replication or jackknife repeated replication for computing the variances. Although there are commercial packages on the market now that will compute variances using these and other methods (e.g., SUDAAN®, WesVar®), I wanted to create a program that was fully integrated into SAS, so that the user could take full advantage of the SAS System. The program is a macro that allows the user to specify the input data set, the full-sample weight, the replicates (an auxiliary program can be used to create the replicates), the independent and dependent variables, the variance estimation method, and other key variables. The program uses the SAS procedures LOGISTIC, REG, GLM, and MIXED, and takes advantage of the new Output Delivery System. In addition to customized output, the program creates output data sets that can be used for graphing, diagnostics, etc. In my applications it has run reasonably fast.
In this paper I describe a SAS macro program that will produce estimates and standard errors that properly account for the sampling design for a two-PSUs-perstratum complex design, one of the most common, for three models: linear regression, logistic regression, and random coefficient mixed models. The design is described in more detail in the next section, followed by a summary of the program.
INTRODUCTION and BACKGROUND More and more research organizations are using complex sampling designs to identify a sample of persons or other units of analysis. A “complex” design is defined as one in which the units selected for study are chosen with unequal probabilities of selection. (This is in contrast to a “typical” study sample in which the units are chosen with simple random sampling.) Government surveys have long used complex designs. An example of such a design is the oft-copied household survey design used by the Bureau of the Census in the household surveys they conduct for themselves and for other government agencies (e.g., in the Current Population Survey, the National Crime Victimization Survey, the National Health Interview Survey). The basic design is based on “drilling down” through nested levels of geography: First counties (or their equivalents) are selected, then smaller geographic units within counties, down to blocks of households and households themselves (U. S. Bureau of the Census, 2000). As a consequence of the growing use of complex sampling designs, particularly in research settings in which the data are used for modeling, there is a growing need for software that produces proper estimates under the conditions of unequal probabilities of selection. As mentioned earlier, there are commercially available packages, such as SUDAAN and WesVar (see Morganstein & Brick, 1996), that meet this need. In addition, SAS' new SURVEYREG procedure uses the method of Taylor series expansion to produce design-adjusted estimates for linear
DESIGN AND VARIANCE ESTIMATION APPROACH Types of Estimates There are two general types of estimates that are desired by survey researchers and survey data users. The first is simple estimates, such as estimated population means and totals and estimates of change over time in these quantities. The second is parameter estimates from models. In models (e.g., linear regression models, logistic regression models), one is interested in getting correct estimates of the association between some set of predictor variables and an outcome variable(s). The program described in this paper produces estimates of regression coefficients and variances for linear regression, logistic regression (with binary or polychotomous outcomes), and random coefficient mixed models. Variance Estimation Paradigm There is some debate about under what circumstances the sampling design should be taken into account when estimating quantities of interest. For example, if one were interested in estimating the parameter of a model associating a set of predictor variables and an outcome variable, if the model is correct (i.e., true), the parameters can be estimated properly from unweighted data without any special adjustments made for the design (e.g., see Sa¨rndal, Swensson, & Wretman, 1992). This type of approach is generally referred to as “model-based” and there are many variations on it. Alternatively, one can take a “design-based” approach, which is the one I am taking in this paper and in the program. This is the same approach used for estimation and variance estimation in the large government surveys, as well as by many researchers. Under this paradigm, estimates that are properly adjusted for the sample design will be unbiased for the true parameters in the sampling frame. Further, they will be unbiased for the target population to the extent that the frame is an unbiased representation of the target population with respect to the quantities that are being estimated. Under this paradigm, it is important that the estimates account for the unequal probabilities of selection and that the variance estimation procedure reflect the design. For point estimates, this means incorporating the sample weights. Loosely speaking, the weight given to a particular
unit is the number of units that unit's data “represent” in the target population. In practice, these weights might be simply the inverse of the probabilities of selection, or they might also incorporate non-response, post-stratification, or other adjustments.
The Two-PSUs-per-Stratum Design The first stage of selection in any complex design is referred to as selecting the “primary sampling units,” or PSUs. PSUs are selected either with certainty or with some probability 0/2-":? 0/,#-! = 0/-,."> "/01#.? "/01.1 = 0/::#:> "/022.? "0/!2:: = ./-2:,> 1./,.2,? "/00-2 = 0/:1#2> "/0:-:?
51 ) A & "-!-/20:!"-# 51 ) =B ? "#10/2."1."# $ A / & 0/1"1#1!""!1 51 55 3 3
".1"/,,!0!#.
7 3 C /02 C "" 1/10
Figure 2. Sample output from %repl_var for proc=mixed with method=jk2.
Random Coefficients Model for Change in YVAR Subset: Variance Estimation based on Jackknife Repeated Replication Full Sample Fit -- *** Ignore Fixed Effect Standard Errors ***
The Mixed Procedure Model Information Data Set
WORK._NEWDAT
Dependent Variable
yvar
Weight Variable
fullwt
Covariance Structure
Unstructured
Subject Effect
id
Estimation Method
REML
Residual Variance Method
Profile
Fixed Effects SE Method
Model-Based
Degrees of Freedom Method
Between-Within
Dimensions Covariance Parameters
4
Columns in X
3
Columns in Z Per Subject
2
Subjects Max Obs Per Subject Observations Used Observations Not Used Total Observations
483 3 1437 0 1437
Iteration History Iteration
Evaluations
-2 Res Log Like
Criterion
0
1
3898.31593541
1
4
3118.62194555
.
2
1
2961.99269268
0.47087927
3
1
2861.62713561
0.33901490
4
1
2813.16623392
0.14424044
5
1
2797.76212210
0.02560024
6
1
2795.45757519
0.00093815
7
1
2795.38064882
0.00000144
8
1
2795.38053324
0.00000000
Convergence criteria met.
Estimated G Matrix Row
Effect
ID Variable
Col1
Col2
1
Intercept
1
0.7181
0.03729
2
LAG
1
0.03729
0.01778
Covariance Parameter Estimates Cov Parm
Subject
Ratio
UN(1,1)
id
UN(2,1) UN(2,2) Residual
Estimate
Standard Error
Z Value
Pr Z
0.2603
0.7181
0.05341
13.45