Logistic regression for longitudinal case-control studies

25 downloads 85 Views 1MB Size Report
Logistic regression for longitudinal case-control studies. Thierry Duchesne1. Department of Mathematics and Statistics. Laval University duchesne@mat. ulaval.
Logistic regression for longitudinal case-control studies Thierry Duchesne1 Department of Mathematics and Statistics Laval University [email protected] Joint work with Radu Craiu (Statistics, Toronto) and Daniel Fortin (Biology, Laval)

Biostatistics seminar McGill University, May 1, 2006

1 Research

supported by NSERC and FQRNT

Outline

Introduction

GEE

GEE for conditional logistic regression

Outline 1

2

3

4

5

Introduction Conditional logistic regression Problem: What if several matched sets per individual/cluster? Generalized estimating equations (GEE) Introduction: A review of GEE GEE for conditional logistic regression Conditional mean and variance Working correlation structures Generalized estimating equations Model selection: the QIC criterion Application: Elk travel in Yellowstone Example on elk travel in Yellowstone Conclusion Current/future research

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses.

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).

Example: Cardiac arrest blood study (Arbogast & Lin, CJS, 2004), 1 to 3 controls for each case in study of effects of alcohol consumption.

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).

Example: Cardiac arrest blood study (Arbogast & Lin, CJS, 2004), 1 to 3 controls for each case in study of effects of alcohol consumption. To estimate the effects of the xsi j ’s on the distribution of Ysi , we use conditional logistic regression.

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Conditional logistic regression model

Hosmer & Lemeshow (1989) For each stratum/matched set s, we assume a stratum-level random effect θs ; (Ys1 |xs1 , θs ), . . . , (Ysns |xsns , θs ) conditionally independent (given θs ) Bernoulli with P[Ysi = 1|xsi , θs ] =

exp{θs + β > xsi } , i = 1, . . . , ns , 1 + exp{θs + β > xsi }

where β > = (β1 , . . . , β p ) is the parameter of interest.

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Distribution of Ys1 , . . . ,Ysns given their sum s Given ∑ni=1 Ysi = ms (denoted “|ms ”), we have that  s > β xsi ysi exp ∑ni=1 P [Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] = ns ,  ns (ms ) > ∑l=1 exp ∑i=1 β xsi vli

(mnss ) where ∑l=1 stands for a sum over all vectors of size ns consisting of ms ‘1’ and ns − ms ‘0’ and where vli is the ith element of the l th such vector, vl .

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Distribution of Ys1 , . . . ,Ysns given their sum s Given ∑ni=1 Ysi = ms (denoted “|ms ”), we have that  s > β xsi ysi exp ∑ni=1 P [Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] = ns ,  ns (ms ) > ∑l=1 exp ∑i=1 β xsi vli

(mnss ) where ∑l=1 stands for a sum over all vectors of size ns consisting of ms ‘1’ and ns − ms ‘0’ and where vli is the ith element of the l th such vector, vl . The random effect θs vanishes by conditioning on ∑i Ysi = ms !!

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Maximum likelihood inference If the strata/matched sets are independent, we have that s L(β ) = ∏ns=1 L(s) (β ), where L(s) (β ) is P[Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] from previous page.

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Maximum likelihood inference If the strata/matched sets are independent, we have that s L(β ) = ∏ns=1 L(s) (β ), where L(s) (β ) is P[Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] from previous page.  s > β xsi ysi exp ∑ni=1 L(β ) = ∏ ns  ( ) s=1 ∑ ms exp ∑ns β > x v si li i=1  l=1 ( ) (mnss ) ns ns S l(β ) = ∑  ∑ β > xsi ysi − ln ∑ exp ∑ β > xsi vli  S

s=1

i=1

l=1

i=1

 ns >  (mnss ) v x exp β x v ∑ ∑ si li li si i=1 . U(β ) = ∑  ∑ xsi ysi − l=1 ns  ( ) n ms s > s=1 i=1 ∑l=1 exp ∑i=1 β xsi vli S



ns

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!!

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional logistic regression

Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!!  exp β > xsi ysi L(β ) = ∏ ∏ ns  ( ) s=1 i=1 ∑ ms exp ∑ns β > x v si li i=1 l=1  S ns exp β > zi δi  LCox (β ) = ∏ ∏ > ∗ s=1 i=1 ∑q∈Qi exp β zq S

ns

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional logistic regression

Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!!  exp β > xsi ysi L(β ) = ∏ ∏ ns  ( ) s=1 i=1 ∑ ms exp ∑ns β > x v si li i=1 l=1  S ns exp β > zi δi  LCox (β ) = ∏ ∏ > ∗ s=1 i=1 ∑q∈Qi exp β zq S

ns

Cases: ti = 1 and δi = 1, Controls: ti = 2 and δi = 0 coxph(Surv(ti,di)∼x+strata(s),method=c("exact"))

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Problem: What if several matched sets per individual/cluster?

What if there is correlation among matched sets?

Likelihood function L(β ) assumes Cov(Ysi ,Ys0 i0 |ms , ms0 , xsi , xs0 i0 ) = 0, s 6= s0 , i.e., reponses from different strata are uncorrelated. What can we do if this is not the case? Elk example Each stratum corresponds to 201 possible step choices for the travel of an elk. Several strata are obtained for each elk ⇒ strata for a same animal might be “correlated”?

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Introduction: A review of GEE

Estimating equations

In most statistical analyses, parameter estimates are obtained by solving estimating equations. Linear regression βˆ

n

= arg min ∑ (Yi − β > xi )2 β

⇔ U(βˆ ) ≡

n

i=1

∑ xi (Yi − βˆ > xi ) = 0.

i=1

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Introduction: A review of GEE

Estimating equations

In most statistical analyses, parameter estimates are obtained by solving estimating equations. Maximum likelihood estimation n

θˆ

= arg max ∏ Li (θ ;Yi , xi ) θ

n

(usually) ⇔ U(θˆ ) ≡



i=1

i=1

∂ ln Li (θ ;Yi , xi ) = 0. ∂θ θ =θˆ

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Introduction: A review of GEE

Generalized estimating equations Data Yi = (Yi1 , . . . ,Yini )> , i = 1, . . . , I, Yi ⊥Yi0 . We let µi j (β ) = E[Yi j |xi j ] and g{µi j (β )} = β > xi j , where g is a known link function. We choose a working correlation structure Ri (α) ≈ Corr[Yi |Xi ]. We set Ai = diag(Var[Yi j |xi j ], j = 1, . . . , ni ). We estimate β by βˆGEE that solves n

−1 ˆ UGEE (βˆGEE ) ≡ ∑ D> i Vi {Yi − µi (βGEE )} = 0, i=1

1/2

1/2

where Di = Ai Xi and Vi = Ai Ri (α)Ai .

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Introduction: A review of GEE

Properties of βˆGEE The estimator βˆGEE that solves UGEE (βˆGEE ) = 0 has the following properties, even if our choice of Ri (α) is not perfect: βˆGEE ≈ N(β , Σ); Σ is consistently estimated by the robust sandwich ˆ EV ˆS =V ˆ TC ˆ T , where variance, V !−1 n ˆT = V D and i ∑ D>i V−1 i α=αˆ i=1 β =βˆ " # n > −1 > −1 ˆE = {Y {Y C D V − µ (β )} − µ (β )} V D . i i i i i ∑ i i i α=αˆ i=1 β =βˆ

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Conditional mean and variance

Objective We wish to use GEE with conditional logistic regression, i.e., when (g)

(g)

we observe (Ysi , xsi ), g = 1, . . . , G (clusters), s = 1, . . . , S(g) (g) (strata/matched sets), i = 1, . . . , ns (individual observations); (g)

(g)

(g)

s we know before sampling the data that ∑ni=1 Ysi = ms ;

(g)

(g0 )

we suppose that Corr∗ (Ysi ,Ys0 i0 ) = 0, but that (g)

(g)

Corr∗ (Ysi ,Ys0 i0 ) may not be 0. Note: Henceforth, ∗ on E, Var, Cov or Corr will denote an operator conditional on the covariates and the stratum sums of the Y ’s.

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conditional mean and variance

Conditional mean (g)

(g)

(g)

(g)

We will need µsi ≡ E[Ysi |ms , xsi ] and (g) (g) (g) (g) (g) (g) µsi,s j ≡ E[Ysi Ys j |ms , xsi , xs j ]: Lemma (Omitting sub/super-scripts (g) and s ...)

µi =

µi, j =

 (mn ) vli exp ∑nk=2 β > x˜ k vlk ∑l=1 (mn ) exp {∑nk=2 β > x˜ k vlk } ∑l=1  (mn ) vli vl j exp ∑nk=2 β > x˜ k vlk ∑l=1 . (mn ) exp {∑nk=2 β > x˜ k vlk } ∑l=1

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Working correlation structures

Variance-covariance matrix of the Y ’s

From assumptions made earlier:  0, g 6= g0   (g) 0 (g) (g) (g) (g ) µsi,si0 − µsi µsi0 , g = g0 , s = s0 Cov∗ (Ysi ,Ys0 i0 ) = q   ∗ (g) (g) (g) (g) (g) (g) ρ (Ysi ,Ys0 i0 ) µsi (1 − µsi )µs0 i0 (1 − µs0 i0 ), (g)

(g)

(g)

(g)

where ρ ∗ (Ysi ,Ys0 i0 ) = Corr∗ (Ysi ,Ys0 i0 ), g = g0 , s 6= s0 .

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Working correlation structures

Correlation structures (g)

(g)

If we set ρ ∗ (Ysi ,Ys0 i0 ) = 0, we get V(g)Indep ≡ Var∗ [Y(g) ], a block diagonal matrix:   (g) B1 0 ··· 0  ..   0 B(g) . . . .    (g)Indep 2 V = . . .. ..  ..  . . 0   (g) 0 ··· 0 BS(g)     (g) 1/2 (g) 1/2 Then put As = Bs ,    1/2  1/2 (g) (g) (g) A = diag As , s = 1, . . . , S , then   1/2 1/2 V(g)Indep = A(g) I A(g) . ⇒ Replace I by R(g) (α) . . .

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Working correlation structures

Correlation structures A few remarks If the correlation is due to a cluster-specific random effect, then the working independence GEE is actually the maximum likelihood score. If a correlation structure, R(g) (α), other than independence is used, estimating the parameter α is problematic. We suggest to use working independence with naive variance estimates (i.e., conditional maximum likelihood) if correlation only due to cluster-level random effects robust sandwich variance estimates if correlation scheme likely more complex

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Generalized estimating equations

Generalized estimating equations

>

(g)>

(g)>

(g)>

Put Y(g) = (Y1 , . . . , YS(g) ), µ (g) (β )> = (µ1 D(g) = ∂ µ (g) (β )/∂ β > .

(g)>

, . . . , µS(g) ) and

GEE for conditional logistic regression  −1 n o G U(β ) = ∑ D(g)> V(g) Y(g) − µ (g) (β ) = 0. g=1

Classical results (asymptotic normality and consistent variance estimation with the robust sandwich variance estimator) are still valid.

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Model selection: the QIC criterion

Covariate selection

1

Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Model selection: the QIC criterion

Covariate selection

1

Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.

2

Fit all sub-models and pick model with best AIC criterion (preferred approach in biology/ecology) ⇒ Not valid if working independence is not the true model.

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Model selection: the QIC criterion

Covariate selection

1

Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.

2

Fit all sub-models and pick model with best AIC criterion (preferred approach in biology/ecology) ⇒ Not valid if working independence is not the true model.

3

As in 2, but replace AIC by a valid criterion ⇒ Pan’s QIC criterion (Biometrics, 2001)?

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Model selection: the QIC criterion

The QIC criterion  (g)> V(g)Indep −1 D(g) and βˆ (R) be the solution of Let ΩI = ∑G g=1 D ˆ the GEE a working correlation R, and let V(R) be the corresponding robust sandwich variance estimate. QIC, Pan (Biometrics, 2001) In the case of conditional logistic regression, the quasi-likelihood under independence criterion (QIC) is defined as ˆ QIC = −2Q{βˆ (R)} + 2trace{ΩI V(R)}, where Q{βˆ (R)} is the log-likelihood (under R = I) evaluated at β = βˆ (R).

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Conclusion

Model selection: the QIC criterion

The QIC criterion  (g)> V(g)Indep −1 D(g) and βˆ (R) be the solution of Let ΩI = ∑G g=1 D ˆ the GEE a working correlation R, and let V(R) be the corresponding robust sandwich variance estimate. QIC, Pan (Biometrics, 2001) In the case of conditional logistic regression, the quasi-likelihood under independence criterion (QIC) is defined as ˆ QIC = −2Q{βˆ (R)} + 2trace{ΩI V(R)}, where Q{βˆ (R)} is the log-likelihood (under R = I) evaluated at β = βˆ (R). We choose the model with the smallest QIC.

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

Where is Yellowstone?

Parc national de Yellowstone

Montana Wyoming

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

Purpose of the analysis

Objectif • Déterminer si la cascade trophique observée dans le parc pouvait être causée par l’influence des loups sur les patrons de déplacements des wapitis.

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

What we are trying to show

Prédiction • L’augmentation des risques de rencontrer des loups diminue la probabilité que les wapitis visitent des peuplements de peupliers fauxtremble.

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

Data collection

10h00

Pa

15h00

s 5h00

00h00

Est-ce que les Pas sont placés de façon aléatoire dans le paysage?

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Example on elk travel in Yellowstone

The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330

Application

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Example on elk travel in Yellowstone

Results, GEE with backward elimination and QIC

Variable

β

Drtmin

0.744

Drtmin2

-0.056

Aspenend

0.338

Forestend

-0.289

Forestprop

-0.770

Sslope

-2.189

Aspenend × Wavg3

-0.885

Forestend × Wavg3

0.313

Wavg3

0.240

Probabilité relative

Fonction de sélection des Pas du wapiti durant l’hiver à Yellowstone Peuplier Forêt Ouvert

Indice de présence des loups

Conclusion

Outline

Introduction

GEE

GEE for conditional logistic regression

Application

Current/future research

Current/future research

Simulations to check when QIC is better than backward elimination. Link between prospective vs retrospective correlations. Estimation of correlation matrix parameters.

Conclusion