Goodness-of-Fit Tests for Logistic Regression with Complex Survey Data

69 downloads 5306 Views 343KB Size Report
Background on weighting adjustment for nonresponse using response propensity scores. Methodology on goodness-of-fit test for response propensity model.
Goodness-of-Fit Tests for Logistic Regression with Complex Survey Data Amang Sukasih Donsig Jang Haixia Xu 2007 Joint Statistical Meeting Salt Lake City, UT, July 31, 2007

Overview Background on weighting adjustment for nonresponse using response propensity scores Methodology on goodness-of-fit test for response propensity model Simulation study to investigate model misspecification Conclusion

Background Weighting adjustment

wi ˆ θ A = ∑ Yi i∈R ai

Calculating adjustment factor:

ai ( c ) = ∑ wi ( c ) i∈Rc

∑w

i∈Sc

i(c)

ai = πˆi , log π i / (1 − π i )  = X i β

Background Weighting adjustment for nonresponse using response propensity model individual estimate of propensity score as adjustment factor weighting cells based on estimated propensity scores Both techniques depend on estimates of propensity scores Estimated propensity scores depend on goodnessof-fit of model

Background Goodness-of-fit test: Hosmer-Lemeshow (H-L) test for simple random samples, available in SAS (unweighted) for complex samples, available in SUDAAN and STATA (design-based, different in rejection regions) Effect of model misspecification goodness-of-fit test distribution of propensity scores weighting cells

Methodology Goodness-of-fit test H00: model fit is good H11: model fit is not good Test for simple random sample

Oc − N cπ c ) ( HL = ∑ c =1 N cπ c (1 − π c ) G

2

∼ χ

2 ( G − 2)

Methodology Test for complex sample

θˆ = (O1 − E1 , , OG − EG ) Oc − Ec = ∑ wi ( Ri − πˆi ) i∈Sc

Oc − Ec = ∑ wi ( Ri − πˆi ) i∈Sc

(STATA) (STATA)

(SUDAAN) (SUDAAN)

−1

Q = θˆ '  var(θˆ)  θˆ

∑w

i∈Sc

i

Methodology Rejection regions SUDAAN Chi-square

Q ∼χ

2 (G−2)

SUDAAN Satterthwaite Adjusted F G * G /(1+ a 2 ) Q ∼ FK − G + 2 , where K = #PSUs − #strata 2 2 λ (1 + a ) SUDAAN Wald F

Q ∼ FKG −1 G

STATA Wald F

K −G +2 Q ∼ FKG− G + 2 GK

Simulation Study Goals to understand available GOF tests for complex survey data to understand effect of model misspecification on GOF test to understand effect of model misspecification on weighting adjustment

Simulation Study Data generation 5 covariates (categorical): X, Y, Z, U, V X, Y, U, V are independent; Z is highly correlated with Y but not with others Covariate

Value

Probability

X

1, 2, 3

.5, .3, .2

Y

1, 2, 3, 4

.25, .25, .25, .25

Z

1, 2, 3, 4, 5

Y + Bernoulli(.5)

U

1, 2

.6, .4

V

1, 2, 3

.65, .25, .1

Simulation Study Data generation (continued) Regression coefficients log π / (1 − π )  = β 0 + β1 X + β 2Y + β 3 Z + β 4U + β 5V

β 0 = −6, β1 = 2.5, β 2 = 0.05, β 3 = 0, β 4 = 1.5, β 5 = −1.5

Response propensity and respondent indicator eX β π = response propensity = 1+ eX β R = respondent indicator ~ Bernoulli(π )

Simulation Study Data generation (continued) Sampling strata (S): Corr(S,X) is high Sampling weight (W) depends on X X 1 2 3

W 66 79 406

Sample size = 50,000; replicates = 1,000

Simulation Study Model misspecification true model:

 π log   1−π

  = β 0 + β1 X + β 2Y + β 4U + β 5V 

overspecified model:  π  log   = β 0 + β1 X + β 2Y + β 3 Z + β 4U + β 5V  1−π  underspecified model:

 π log   1−π

  = β 0 + β 2Y + β 4U + β 5V 

Simulation Study Evaluation: estimates of regression coefficients goodness-of-fit test’s p-values response propensity score distribution and weighting cells construction Software: SAS, SUDAAN, STATA

Simulation Study Means of Regression Coefficients True model True BETA coefficient SAS SUDAAN STATA Beta_0 X=2 X=3 Y=2 Y=3 Y=4 Z= 2 Z= 3 Z= 4 Z= 5 U=2 V= 2 V= 3

-3.5 1 2 0.5 1 1.5 0 0 0 0 2.5 -1.5 -3

-3.50163 0.9995 2.00074 0.50164 1.00052 1.49982

-3.50112 0.99938 2.00052 0.50003 0.99913 1.49933

-3.50101 0.99938 2.00052 0.50003 0.99913 1.49933

2.50145 -1.50051 -3.00051

2.50128 -1.49863 -2.9975

2.50128 -1.49863 -2.99746

Overspecified model SAS SUDAAN STATA -3.50178 0.99964 2.00097 0.50749 1.00899 1.50556 -0.00591 -0.00877 -0.00563 -0.00963 2.50173 -1.50068 -3.00082

-3.50137 0.99963 2.00095 0.51145 1.01524 1.51638 -0.01161 -0.01639 -0.01697 -0.02295 2.50185 -1.49895 -2.99822

-3.50137 0.99963 2.00095 0.51145 1.01524 1.51638 -0.01161 -0.01639 -0.01697 -0.02295 2.50185 -1.49895 -2.99822

Underspecified model SAS SUDAAN STATA -2.5162

-1.90706

-1.90706

0.44943 0.90045 1.35128

0.44756 0.8982 1.34807

0.44756 0.8982 1.34807

2.23122 -1.34454 -2.70838

2.22006 -1.33848 -2.70305

2.22006 -1.33848 -2.70305

Simulation Study Percentiles for p-values of True Model Percentiles of p-values

120 100

Y=X

80

SAS SUDAAN_Chi SUDAAN_Sath

60

SUDAAN_Wald

40

STATA

20 0 0

20

40

60

Percentiles

80

100

Simulation Study Percentiles for p-values of Overspecified Model Percentiles of p-values

120 100

Y=X

80

SAS SUDAAN_Chi SUDAAN_Sath

60

SUDAAN_Wald

40

STATA

20 0 0

20

40

60

Percentiles

80

100

Simulation Study Percentiles for p-values of Underspecified Model 100

Percentiles of p-values

90 80 70

Y=X

60

SAS

50

SUDAAN_Chi SUDAAN_Sath

40

SUDAAN_Wald

30

STATA

20 10 0 0

20

40

60

Percentiles

80

100

Simulation Study Means of Propensity Score Percentiles Mean of propensity scores

1 0.9

SAS True

0.8

SAS Over SAS Under

0.7 0.6

SUDAAN True SUDAAN Over

0.5

SUDAAN Under STATA True

0.4 0.3

STATA Over STATA Under

0.2 0.1 0 min p10 p20 p30 p40 p50 p60 p70 p80 p90 max Min, Percentiles, Max

Conclusion Simple random sample H-L and STATA GOF tests overstate the true p-value Chi-square rejection region works poorly for GOF test for complex survey data Underspecified model can be detected by low pvalue in the GOF test Minor model overspecification does not affect distribution of propensity scores, nor the weighting adjustment factors

Suggest Documents