Replication-Based Variance Estimation Methods for Survey Data

34 downloads 0 Views 217KB Size Report
The most commonly used variance estimation method for survey data is the Taylor series ... estimator of variance, each replicate sample should be drawn by following some ... Assume nh sample PSUs are selected from Nh population PSUs in stratum h, where ... repeating the process for each stratum independently.
SAS Global Forum 2008

Statistics and Data Analysis

Paper 367-2008

Try, Try Again: Replication-Based Variance Estimation Methods for Survey Data Analysis in SAS® 9.2 Pushpal K Mukhopadhyay, Anthony B. An, Randall D. Tobias, and Donna L. Watts SAS Institute Inc., Cary, NC

ABSTRACT Complex survey samples are constructed with selection schemes that affect the usual random assumptions, so SAS/STAT® software provides specialized procedures to analyze them: SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC for means, frequencies, regression, and logistic analysis, respectively. These procedures all use the Taylor series expansion method for variance estimation, which is usually considered to be the "gold standard" when it is practical to compute. However, replication methods are also widely used in practice for variance estimation. Replication methods, such as the jackknife and balanced repeated replication (BRR), replace complex algebra with simple repeated analysis. They enable you to analyze the data without the original sample design, protecting survey security, and they ease the task of estimating variances for nonlinear quantities. With the release of SAS 9.2, the SAS/STAT survey analysis procedures now also implement replication methods. These include standard approaches such as jackknife and BRR as well as customized replication methods that employ usersupplied replicate weights. This paper discusses replication methods, comparing them to the Taylor series expansion method with respect to both technical characteristics and practical utility. This paper also discusses other significant enhancements to the survey design and analysis procedures in SAS 9.2.

INTRODUCTION Sample surveys provide information about a finite population by observing only a fraction of the population. To provide statistically valid inference, samples are typically selected through randomization techniques. Statistical agencies such as the U.S. Census Bureau, Bureau of Labor Statistics, Statistics Canada, and Statistics Sweden conduct surveys to collect information about social and economic conditions of people, households, businesses, and industries. Surveys of natural resources and opinion polls are also common. Most of these surveys collect data through complex designs that include stratification and clustering, requiring special techniques for analysis; see Lohr (1999), Särndal, Swenson, and Wretman (1992), and Chamber and Skinner (2003). SAS/STAT software provides specialized procedures to analyze survey data: SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC for means, frequencies, regression, and logistic analysis, respectively. The most commonly used variance estimation method for survey data is the Taylor series expansion method. This method obtains a linear approximation of an estimator by using a Taylor series expansion. The precision of the linearized statistic is then estimated by using standard survey variance estimation methods. Taylor series expansion is often considered to be the “gold standard” for survey variance estimation, but it can be complicated to derive for some estimators that are nonlinear functions of means. It also requires you to specify variables that contain strata and cluster information (for a stratified multistage design). Strata or cluster information might not be available due to data confidentiality. This identification requirement can be a major limitation for the Taylor series expansion method. An alternative to the Taylor series expansion method that addresses both the complexity issue and the identification requirement is replication-based variance estimation. A replicate sample is a subsample of the original sample. An estimate of a quantity of interest is obtained for each replicate sample. The variability of the estimated quantity among the replicate samples is then used as a replication-based estimator of variance. In order to obtain a statistically justified estimator of variance, each replicate sample should be drawn by following some specific resampling scheme. Currently SAS/STAT survey analysis procedures support the two most widely used replication variance estimation methods, the jackknife and balanced repeated replication (BRR). The SURVEYMEANS, SURVEYFREQ, SURVEYREG, and SURVEYLOGISTIC procedures now support the Taylor series expansion, jackknife, and BRR variance estimation, by using the PROC statement options VARMETHOD = TAYLOR, VARMETHOD=JACKKNIFE, and VARMETHOD=BRR, respectively. This paper illustrates the application of different variance estimation methods by using data collected through a complex national survey.

1

SAS Global Forum 2008

Statistics and Data Analysis

THE JACKKNIFE VARIANCE ESTIMATION METHOD The jackknife variance estimation method in SAS is available for any survey design. For simplicity, consider a clustered stratified sample design where the first stage clusters, or primary sampling units (PSUs), are selected by using a simple random sample with replacement. Assume nh sample PSUs are selected from Nh population PSUs in stratum h, where h D 1; 2; : : : ; H . Let  be a finite population quantity of interest and O be a sample-based estimator of  . The delete-1 jackknife method deletes one PSU at a time and adjusts the full sample weight for the other clusters in that stratum, repeating the process for each stratum independently. The adjusted observation weights in each replicate sample are called replicate weights. Let Or denote the estimate of  obtained from the rth replicate weights. Then the jackknife variance estimator of O is VO .O / D

R X

O /2

˛r .Or

(1)

rD1

where ˛r D nh 1 .nh 1/ and R is the total number replicates. The quantity ˛r is also called the jackknife coefficient. In this example, the total number of replicates is the same as the total number of clusters in the full sample. See Wolter (1985), Rust (1985), and Shao and Tu (1995) for details. Unless specified otherwise, the term ‘jackknife’ method denotes the delete-1 jackknife method throughout this paper. THE BALANCED REPEATED REPLICATION METHOD Another replication-based variance estimation method is balanced repeated replication (BRR). The most common form of the BRR method is suitable for sample designs that have a large number of strata with two PSUs in each stratum, where the PSUs are selected with replacement. A replicate sample (also known as half sample) is obtained by deleting one PSU per stratum and doubling the original weight of the remaining PSU. To satisfy certain balance conditions, the PSUs are deleted according to a corresponding Hadamard matrix. See Wolter (1985) for an introduction to and construction of Hadamard matrices. The BRR variance estimator of a full sample estimator O is given by VO .O / D R

R X

1

.Or

O /2

(2)

rD1

where Or is an estimator of  using the rth balanced half sample and R is the total number of replicates. See Wolter (1985), Rust (1985), and Shao and Tu (1995) for details. In many situations, especially for nonlinear estimators, one or more replicate estimators Or might be undefined but the full sample estimator O is defined. Fay’s BRR method adjusts the original weight by a coefficient  (0   < 1) so that the replicate estimators are defined for all replicate samples. This method is similar to the traditional BRR method, but instead of deleting one PSU per stratum, it multiplies the original weight by the coefficient . The original weight for the remaining PSU in that stratum is multiplied by 2 . The Fay’s BRR variance estimator of O is computed as n VO .O / D R.1

/2

o

1

R X

.Or

O /2

(3)

rD1

where 0   < 1. See Dippo, Fay, and Morganstein (1984), Fay (1989), Judkins (1990), and Rao and Shao (1999) for more information. Note that when  D 0, then Fay’s BRR method becomes the traditional BRR method. SYNTAX FOR REPLICATION METHODS The new syntax for specifying replication-based variance estimation consists of the VARMETHOD= option in the PROC statement: VARMETHOD=BRR VARMETHOD=JACKKNIFE | JK VARMETHOD=TAYLOR

and the REPWEIGHTS statement for specifying user-defined replication weights: REPWEIGHTS variables;

These options and statement are available in all four survey analysis procedures. The VARMETHOD=BRR and VARMETHOD=JACKKNIFE options and the REPWEIGHTS statement have further sub-options, whose complete details are described in the following chapters of the SAS/STAT User’s Guide: “The SURVEYMEANS Procedure,” “The SURVEYFREQ Procedure,” “The SURVEYREG Procedure,” and “The SURVEYLOGISTIC Procedure.” 2

SAS Global Forum 2008

Statistics and Data Analysis

THE MEPS SURVEY The Medical Expenditure Panel Survey (MEPS) is a nationally representative survey of U.S. civilian noninstitutionalized population, conducted annually by the U.S. Department of Health and Human Resources. The main objectives are to determine the cost of specific health services and the quality of health insurance available to U.S. workers. The household component (HC) of the MEPS collects demographic characteristics, health conditions, health insurance coverage, and health care expenditure information through household interviews. Data are collected by using a sample of families and individuals through a multistage overlapping panel design. The primary sampling units are defined by geographic locations such as counties, small groups of counties, or metropolitan statistical areas. Within a PSU, area segments and permit area segments are used as second stage units. See the Web site at http://www.meps.ahrq.gov/mepsweb/about_meps/survey_back.jsp for a detailed description of the survey design. The 1999 full-year consolidated data file contains 24,618 individuals who are divided into 143 strata and 460 PSUs. The strata and PSU information are useful for variance estimation purposes. The 1999 full-year consolidated data file HC-038 (MEPS HC-038, 2002) from the MEPS is used to illustrate different variance estimation methods that are available in PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, and PROC SURVEYLOGISTIC procedures. The data can be downloaded directly from the Agency for Healthcare Research and Quality (AHRQ) Web site at http://www.meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail. jsp?cboPufNumber=HC-038 in either ASCII format or SAS transport format. For the examples used in this paper, the analysis variables are the following individual level items for 1999:  expenditure—total health care expenditure  insuranceType—type of insurance coverage  poverty—poverty category  totalIncome—total income In addition, the following demographic variables are used as covariates and for classifications:  sex—gender  age99x—age  region99—census region of residence Finally, the following variables are used for design specifications:  varianceStrata—strata identification  variancePSU—PSU identification  perwt99—person level weights The following SAS statements create a data set for illustration. The input data set meps.h38 has been downloaded from the MEPS website. Note that this data set is used only to demonstrate different variance estimation methods available in SAS/STAT 9.2. Neither the data set nor the SAS statements should be used for inferential purposes.

3

SAS Global Forum 2008

Statistics and Data Analysis

libname meps ’’; data exampledata; set meps.h38; age = age99x; if age99x = -1 then do; if age42x = -1 then age = age31x; else age = age42x; end; region = region99; if region99 = -1 then do; if region42 = -1 then region = region31; else region = region42; end; totexpnonzero = expenditure + 1; logExpenditure = log(totexpnonzero); if inscov99 = 3 then insured = ’NO ’; else insured = ’YES’; rename totexp99 = expenditure inscov99 = insuranceType povcat99 = poverty ttlp99x = totalIncome varstr99 = varianceStrata varpsu99 = variancePSU perwt99f = personWeight; run;

THE JACKKNIFE VARIANCE ESTIMATION METHOD SYNTAX Use the VARMETHOD = JACKKNIFE knife variance estimation method.

|

JK

< method-options > option in the PROC statement to request the jack-

You can specify the following method-options in parentheses after the VARMETHOD=JACKKNIFE option: OUTJKCOEFS=SAS-data-set names a SAS data set to store the jackknife coefficients. OUTWEIGHTS=SAS-data-set names a SAS data set to store the replicate weights that the procedure creates for jackknife variance estimation. The OUTWEIGHTS= method-option is not available when you provide replicate weights with a REPWEIGHTS statement. You can specify the optional STRATA or CLUSTER statement with the VARMETHOD = JACKKNIFE option. The only requirement is at least two PSUs/observations per stratum for a stratified design or at least two PSUs/observations in the data set. The JACKKNIFE syntax is identical for all procedures. For details, see the following chapters of the SAS/STAT User’s Guide: “The SURVEYMEANS Procedure,” “The SURVEYFREQ Procedure,” “The SURVEYREG Procedure,” and “The SURVEYLOGISTIC Procedure.” ESTIMATES FOR POPULATION MEANS Suppose you want to estimate the mean of total health care expenditure of a person for the 1999 population. You can use the SURVEYMEANS procedure with expenditure as the analysis variable. The following SAS statements use the Taylor series expansion method to estimate the variance of the estimated mean. The STRATA statement specifies the stratification variable, the CLUSTER statement specifies the PSU identification, and the WEIGHT statement specifies the individual level survey weights. The VARMETHOD = TAYLOR option in the PROC statement requests the Taylor series expansion method. proc surveymeans data = exampledata varmethod = taylor; strata varianceStrata; cluster variancePSU; weight personWeight; var expenditure; run;

4

SAS Global Forum 2008

Statistics and Data Analysis

Figure 1 displays the data summary and estimated values produced by PROC SURVEYMEANS. 1053 observations with nonpositive weights are not used for the subsequent analysis. The estimated mean health care expenditure using 23565 individuals is 2156.47 with a standard error of 62.72. The 95% confidence interval for the mean expenditure is (2033.06, 2279.88). Figure 1 The Taylor Series Expansion Method for the SURVEYMEANS Procedure The SURVEYMEANS Procedure Data Summary Number Number Number Number Number Sum of

of Strata of Clusters of Observations of Observations Used of Obs with Nonpositive Weights Weights

143 460 24618 23565 1053 276410767

Statistics Std Error Variable Label N Mean of Mean 95% CL for Mean ----------------------------------------------------------------------------------------------------------expenditure TOTAL HEALTH CARE EXP 99 23565 2156.468447 62.723462 2033.06156 2279.87533 -----------------------------------------------------------------------------------------------------------

To estimate the variance of the estimated mean by using the jackknife variance estimation method, you simply need to specify VARMETHOD = JACKKNIFE as a PROC statement option. The following SAS statements obtain an estimate of the variance by using the jackknife variance estimation method. proc surveymeans data = exampledata varmethod = jackknife; strata varianceStrata; cluster variancePSU; weight personWeight; var expenditure; run;

There are a total of 460 PSUs. The jackknife variance estimation method creates 460 replicate samples by deleting one PSU at a time. Since most surveys contain a large number of PSUs, it is computationally efficient to save the replicate weights and the jackknife coefficients in SAS data sets and use the saved values for subsequent analyses. The replicate weights and the jackknife coefficients can be saved in SAS data sets by using the OUTWEIGHTS = and the OUTJKCOEFS = method-options for VARMETHOD = JACKKNIFE. The following SAS statements use the jackknife variance estimation method and save replicate weights and jackknife coefficients in SAS data sets jkrepweights and jkcoefficients, respectively. The data set jkrepweights contains all the variables in the data set exampledata, in addition to the replicate weight variables named RepWt_1 to RepWt_460. proc surveymeans data = exampledata varmethod = jackknife (outweights = jkrepweights outjkcoefs = jkcoefficients); strata varianceStrata; cluster variancePSU; weight personWeight; var expenditure; run;

Figure 2 displays the variance estimation method and Figure 3 displays the estimated values produced by PROC SURVEYMEANS. There are a total of 460 replicates generated by the procedure. The estimated mean health care expenditure is 2156.47 with a standard error of 62.74. The 95% confidence interval for the mean expenditure is (2033.03, 2279.90).

5

SAS Global Forum 2008

Statistics and Data Analysis

Figure 2 The Jackknife Variance Estimation Method for the SURVEYMEANS Procedure The SURVEYMEANS Procedure Variance Estimation Method Number of Replicates

Jackknife 460

Figure 3 The Jackknife Variance Estimation Method for the SURVEYMEANS Procedure, Estimated Values Statistics Std Error Variable Label N Mean of Mean 95% CL for Mean ----------------------------------------------------------------------------------------------------------expenditure TOTAL HEALTH CARE EXP 99 23565 2156.468447 62.737868 2033.03322 2279.90367 -----------------------------------------------------------------------------------------------------------

ESTIMATES FOR A CONTINGENCY TABLE Suppose you want to estimate the percentage of individuals within each category of health insurance coverage and the percentage of individuals within each cross-classification of poverty categories and insurance categories for the 1999 population. You can use the SURVEYFREQ procedure with insuranceType and poverty as analysis variables. The following SAS statements request a two-way table for insurance coverage by poverty categories and use the jackknife variance estimation method to calculate standard errors. proc surveyfreq data = jkrepweights varmethod = jackknife; weight personWeight; repweights RepWt_1-RepWt_460 / jkcoefs = jkcoefficients; tables insuranceType*poverty; run;

The data set jkrepweights obtained from the previous PROC SURVEYMEANS statements contains all the variables in the data set exampledata, in addition to the replicate weight variables RepWt_1 to RepWt_460. The REPWEIGHTS statement specifies the names for the replicate weight variables. The REPWEIGHTS statement option JKCOEFS specifies the SAS data set that contains the jackknife coefficient for each observation. In this particular example, the SURVEYFREQ procedure will not generate the replicate weights and the jackknife coefficients. The input replicate weights and jackknife coefficients contain all the necessary information for variance estimation, so that if the replicate weights and the jackknife coefficients are specified, then the strata and the cluster identifications are not required. The VARMETHOD = JACKKNIFE option in the PROC statement requests the jackknife variance estimation method. Figure 4 displays the twoway table of insuranceType and povact99. Standard errors of estimated quantities are calculated by using the jackknife method.

6

SAS Global Forum 2008

Statistics and Data Analysis

Figure 4 The Jackknife Variance Estimation Method for the SURVEYFREQ Procedure The SURVEYFREQ Procedure Table of insuranceType by poverty Weighted Std Dev of Std Err of insuranceType poverty Frequency Frequency Wgt Freq Percent Percent ------------------------------------------------------------------------------1 1 936 9689015 624759 3.5053 0.2066 2 498 5032074 464129 1.8205 0.1649 3 2040 22442179 1129017 8.1191 0.3302 4 5857 69835693 2816150 25.2652 0.6174 5 6799 97405036 3721701 35.2392 0.7943 Total 16130 204403997 6554617 73.9494 0.6889 ------------------------------------------------------------------------------2 1 1920 16708348 1034325 6.0448 0.3676 2 487 4581804 420803 1.6576 0.1487 3 844 9183636 574957 3.3225 0.2125 4 672 7758261 486381 2.8068 0.1815 5 318 3577523 284380 1.2943 0.0994 Total 4241 41809572 1559405 15.1259 0.5543 ------------------------------------------------------------------------------3 1 806 6397775 500629 2.3146 0.1714 2 338 2708118 270952 0.9797 0.0950 3 773 7302179 503540 2.6418 0.1716 4 877 8587833 575319 3.1069 0.1734 5 400 5201293 380308 1.8817 0.1284 Total 3194 30197198 1319234 10.9248 0.3587 ------------------------------------------------------------------------------Total 1 3662 32795137 1527870 11.8646 0.5000 2 1323 12321996 774817 4.4579 0.2656 3 3657 38927994 1536394 14.0834 0.4435 4 7406 86181788 3269175 31.1789 0.6907 5 7517 106183852 3946498 38.4152 0.8241 Total 23565 276410767 7790639 100.000 -------------------------------------------------------------------------------

ESTIMATES FOR REGRESSION COEFFICIENTS Suppose you want to estimate the regression coefficients for the 1999 population when the natural log of total health care expenditure (logExpenditure) is regressed on age (age) and gender (sex). You can use the SURVEYREG procedure with logExpenditure as the dependent variable, as shown in the following statements. proc surveyreg data = exampledata varmethod = taylor; strata varianceStrata; cluster variancePSU; weight personWeight; class sex; model logExpenditure = age sex / solution; run;

The STRATA statement specifies the stratum identification, the CLUSTER statement specifies the PSU identification, the WEIGHT statement specifies individual level survey weights, and the CLASS statement specifies the classification variable sex. The dependent variable and the independent variables are specified in the MODEL statement. The SOLUTION option in the MODEL statement displays the parameter estimates. The standard errors of the estimated regression coefficients are calculated by using the Taylor series expansion method. The VARMETHOD = TAYLOR option in the PROC statement requests the Taylor series expansion method. Figure 5 displays the parameter estimates along with their standard errors produced by PROC SURVEYREG.

7

SAS Global Forum 2008

Statistics and Data Analysis

Figure 5 The Taylor Series Expansion Method for the SURVEYREG Procedure The SURVEYREG Procedure Regression Analysis for Dependent Variable logExpenditure Estimated Regression Coefficients

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept age SEX 1 SEX 2

4.5501064 0.0376353 -0.7890617 0.0000000

0.05728676 0.00096827 0.03962576 0.00000000

79.43 38.87 -19.91 .

Suggest Documents