Logistic regression for longitudinal case-control studies. Thierry Duchesne1.
Department of Mathematics and Statistics. Laval University duchesne@mat.
ulaval.
Logistic regression for longitudinal case-control studies Thierry Duchesne1 Department of Mathematics and Statistics Laval University
[email protected] Joint work with Radu Craiu (Statistics, Toronto) and Daniel Fortin (Biology, Laval)
Biostatistics seminar McGill University, May 1, 2006
1 Research
supported by NSERC and FQRNT
Outline
Introduction
GEE
GEE for conditional logistic regression
Outline 1
2
3
4
5
Introduction Conditional logistic regression Problem: What if several matched sets per individual/cluster? Generalized estimating equations (GEE) Introduction: A review of GEE GEE for conditional logistic regression Conditional mean and variance Working correlation structures Generalized estimating equations Model selection: the QIC criterion Application: Elk travel in Yellowstone Example on elk travel in Yellowstone Conclusion Current/future research
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses.
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).
Example: Cardiac arrest blood study (Arbogast & Lin, CJS, 2004), 1 to 3 controls for each case in study of effects of alcohol consumption.
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Type of data to be analyzed Dataset of the form (Ysi , xsi ), i = 1, . . . , ns , s = 1, . . . , S, where x> si = (xsi1 , . . . , xsip ) are covariates and Ysi are binary (0 or 1) responses. s Ysi = ms are fixed by study design in We suppose that ∑ni=1 each stratum (e.g., case-control: ns = 2, ms = 1).
Example: Cardiac arrest blood study (Arbogast & Lin, CJS, 2004), 1 to 3 controls for each case in study of effects of alcohol consumption. To estimate the effects of the xsi j ’s on the distribution of Ysi , we use conditional logistic regression.
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Conditional logistic regression model
Hosmer & Lemeshow (1989) For each stratum/matched set s, we assume a stratum-level random effect θs ; (Ys1 |xs1 , θs ), . . . , (Ysns |xsns , θs ) conditionally independent (given θs ) Bernoulli with P[Ysi = 1|xsi , θs ] =
exp{θs + β > xsi } , i = 1, . . . , ns , 1 + exp{θs + β > xsi }
where β > = (β1 , . . . , β p ) is the parameter of interest.
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Distribution of Ys1 , . . . ,Ysns given their sum s Given ∑ni=1 Ysi = ms (denoted “|ms ”), we have that s > β xsi ysi exp ∑ni=1 P [Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] = ns , ns (ms ) > ∑l=1 exp ∑i=1 β xsi vli
(mnss ) where ∑l=1 stands for a sum over all vectors of size ns consisting of ms ‘1’ and ns − ms ‘0’ and where vli is the ith element of the l th such vector, vl .
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Distribution of Ys1 , . . . ,Ysns given their sum s Given ∑ni=1 Ysi = ms (denoted “|ms ”), we have that s > β xsi ysi exp ∑ni=1 P [Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] = ns , ns (ms ) > ∑l=1 exp ∑i=1 β xsi vli
(mnss ) where ∑l=1 stands for a sum over all vectors of size ns consisting of ms ‘1’ and ns − ms ‘0’ and where vli is the ith element of the l th such vector, vl . The random effect θs vanishes by conditioning on ∑i Ysi = ms !!
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Maximum likelihood inference If the strata/matched sets are independent, we have that s L(β ) = ∏ns=1 L(s) (β ), where L(s) (β ) is P[Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] from previous page.
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Maximum likelihood inference If the strata/matched sets are independent, we have that s L(β ) = ∏ns=1 L(s) (β ), where L(s) (β ) is P[Ys1 = ys1 , . . . ,Ysns = ysns |ms , Xs ] from previous page. s > β xsi ysi exp ∑ni=1 L(β ) = ∏ ns ( ) s=1 ∑ ms exp ∑ns β > x v si li i=1 l=1 ( ) (mnss ) ns ns S l(β ) = ∑ ∑ β > xsi ysi − ln ∑ exp ∑ β > xsi vli S
s=1
i=1
l=1
i=1
ns > (mnss ) v x exp β x v ∑ ∑ si li li si i=1 . U(β ) = ∑ ∑ xsi ysi − l=1 ns ( ) n ms s > s=1 i=1 ∑l=1 exp ∑i=1 β xsi vli S
ns
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!!
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional logistic regression
Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!! exp β > xsi ysi L(β ) = ∏ ∏ ns ( ) s=1 i=1 ∑ ms exp ∑ns β > x v si li i=1 l=1 S ns exp β > zi δi LCox (β ) = ∏ ∏ > ∗ s=1 i=1 ∑q∈Qi exp β zq S
ns
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional logistic regression
Similarity with Cox partial likelihood We can fit such a model with our favorite stratified Cox regression software (e.g., coxph(), PROC PHREG)!!! exp β > xsi ysi L(β ) = ∏ ∏ ns ( ) s=1 i=1 ∑ ms exp ∑ns β > x v si li i=1 l=1 S ns exp β > zi δi LCox (β ) = ∏ ∏ > ∗ s=1 i=1 ∑q∈Qi exp β zq S
ns
Cases: ti = 1 and δi = 1, Controls: ti = 2 and δi = 0 coxph(Surv(ti,di)∼x+strata(s),method=c("exact"))
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Problem: What if several matched sets per individual/cluster?
What if there is correlation among matched sets?
Likelihood function L(β ) assumes Cov(Ysi ,Ys0 i0 |ms , ms0 , xsi , xs0 i0 ) = 0, s 6= s0 , i.e., reponses from different strata are uncorrelated. What can we do if this is not the case? Elk example Each stratum corresponds to 201 possible step choices for the travel of an elk. Several strata are obtained for each elk ⇒ strata for a same animal might be “correlated”?
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Introduction: A review of GEE
Estimating equations
In most statistical analyses, parameter estimates are obtained by solving estimating equations. Linear regression βˆ
n
= arg min ∑ (Yi − β > xi )2 β
⇔ U(βˆ ) ≡
n
i=1
∑ xi (Yi − βˆ > xi ) = 0.
i=1
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Introduction: A review of GEE
Estimating equations
In most statistical analyses, parameter estimates are obtained by solving estimating equations. Maximum likelihood estimation n
θˆ
= arg max ∏ Li (θ ;Yi , xi ) θ
n
(usually) ⇔ U(θˆ ) ≡
∑
i=1
i=1
∂ ln Li (θ ;Yi , xi ) = 0. ∂θ θ =θˆ
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Introduction: A review of GEE
Generalized estimating equations Data Yi = (Yi1 , . . . ,Yini )> , i = 1, . . . , I, Yi ⊥Yi0 . We let µi j (β ) = E[Yi j |xi j ] and g{µi j (β )} = β > xi j , where g is a known link function. We choose a working correlation structure Ri (α) ≈ Corr[Yi |Xi ]. We set Ai = diag(Var[Yi j |xi j ], j = 1, . . . , ni ). We estimate β by βˆGEE that solves n
−1 ˆ UGEE (βˆGEE ) ≡ ∑ D> i Vi {Yi − µi (βGEE )} = 0, i=1
1/2
1/2
where Di = Ai Xi and Vi = Ai Ri (α)Ai .
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Introduction: A review of GEE
Properties of βˆGEE The estimator βˆGEE that solves UGEE (βˆGEE ) = 0 has the following properties, even if our choice of Ri (α) is not perfect: βˆGEE ≈ N(β , Σ); Σ is consistently estimated by the robust sandwich ˆ EV ˆS =V ˆ TC ˆ T , where variance, V !−1 n ˆT = V D and i ∑ D>i V−1 i α=αˆ i=1 β =βˆ " # n > −1 > −1 ˆE = {Y {Y C D V − µ (β )} − µ (β )} V D . i i i i i ∑ i i i α=αˆ i=1 β =βˆ
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Conditional mean and variance
Objective We wish to use GEE with conditional logistic regression, i.e., when (g)
(g)
we observe (Ysi , xsi ), g = 1, . . . , G (clusters), s = 1, . . . , S(g) (g) (strata/matched sets), i = 1, . . . , ns (individual observations); (g)
(g)
(g)
s we know before sampling the data that ∑ni=1 Ysi = ms ;
(g)
(g0 )
we suppose that Corr∗ (Ysi ,Ys0 i0 ) = 0, but that (g)
(g)
Corr∗ (Ysi ,Ys0 i0 ) may not be 0. Note: Henceforth, ∗ on E, Var, Cov or Corr will denote an operator conditional on the covariates and the stratum sums of the Y ’s.
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conditional mean and variance
Conditional mean (g)
(g)
(g)
(g)
We will need µsi ≡ E[Ysi |ms , xsi ] and (g) (g) (g) (g) (g) (g) µsi,s j ≡ E[Ysi Ys j |ms , xsi , xs j ]: Lemma (Omitting sub/super-scripts (g) and s ...)
µi =
µi, j =
(mn ) vli exp ∑nk=2 β > x˜ k vlk ∑l=1 (mn ) exp {∑nk=2 β > x˜ k vlk } ∑l=1 (mn ) vli vl j exp ∑nk=2 β > x˜ k vlk ∑l=1 . (mn ) exp {∑nk=2 β > x˜ k vlk } ∑l=1
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Working correlation structures
Variance-covariance matrix of the Y ’s
From assumptions made earlier: 0, g 6= g0 (g) 0 (g) (g) (g) (g ) µsi,si0 − µsi µsi0 , g = g0 , s = s0 Cov∗ (Ysi ,Ys0 i0 ) = q ∗ (g) (g) (g) (g) (g) (g) ρ (Ysi ,Ys0 i0 ) µsi (1 − µsi )µs0 i0 (1 − µs0 i0 ), (g)
(g)
(g)
(g)
where ρ ∗ (Ysi ,Ys0 i0 ) = Corr∗ (Ysi ,Ys0 i0 ), g = g0 , s 6= s0 .
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Working correlation structures
Correlation structures (g)
(g)
If we set ρ ∗ (Ysi ,Ys0 i0 ) = 0, we get V(g)Indep ≡ Var∗ [Y(g) ], a block diagonal matrix: (g) B1 0 ··· 0 .. 0 B(g) . . . . (g)Indep 2 V = . . .. .. .. . . 0 (g) 0 ··· 0 BS(g) (g) 1/2 (g) 1/2 Then put As = Bs , 1/2 1/2 (g) (g) (g) A = diag As , s = 1, . . . , S , then 1/2 1/2 V(g)Indep = A(g) I A(g) . ⇒ Replace I by R(g) (α) . . .
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Working correlation structures
Correlation structures A few remarks If the correlation is due to a cluster-specific random effect, then the working independence GEE is actually the maximum likelihood score. If a correlation structure, R(g) (α), other than independence is used, estimating the parameter α is problematic. We suggest to use working independence with naive variance estimates (i.e., conditional maximum likelihood) if correlation only due to cluster-level random effects robust sandwich variance estimates if correlation scheme likely more complex
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Generalized estimating equations
Generalized estimating equations
>
(g)>
(g)>
(g)>
Put Y(g) = (Y1 , . . . , YS(g) ), µ (g) (β )> = (µ1 D(g) = ∂ µ (g) (β )/∂ β > .
(g)>
, . . . , µS(g) ) and
GEE for conditional logistic regression −1 n o G U(β ) = ∑ D(g)> V(g) Y(g) − µ (g) (β ) = 0. g=1
Classical results (asymptotic normality and consistent variance estimation with the robust sandwich variance estimator) are still valid.
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Model selection: the QIC criterion
Covariate selection
1
Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Model selection: the QIC criterion
Covariate selection
1
Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.
2
Fit all sub-models and pick model with best AIC criterion (preferred approach in biology/ecology) ⇒ Not valid if working independence is not the true model.
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Model selection: the QIC criterion
Covariate selection
1
Backward selection based on robust sandwich standard errors ⇒ Valid approach, but an approach that would compare all possible sub-models would be preferable.
2
Fit all sub-models and pick model with best AIC criterion (preferred approach in biology/ecology) ⇒ Not valid if working independence is not the true model.
3
As in 2, but replace AIC by a valid criterion ⇒ Pan’s QIC criterion (Biometrics, 2001)?
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Model selection: the QIC criterion
The QIC criterion (g)> V(g)Indep −1 D(g) and βˆ (R) be the solution of Let ΩI = ∑G g=1 D ˆ the GEE a working correlation R, and let V(R) be the corresponding robust sandwich variance estimate. QIC, Pan (Biometrics, 2001) In the case of conditional logistic regression, the quasi-likelihood under independence criterion (QIC) is defined as ˆ QIC = −2Q{βˆ (R)} + 2trace{ΩI V(R)}, where Q{βˆ (R)} is the log-likelihood (under R = I) evaluated at β = βˆ (R).
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Conclusion
Model selection: the QIC criterion
The QIC criterion (g)> V(g)Indep −1 D(g) and βˆ (R) be the solution of Let ΩI = ∑G g=1 D ˆ the GEE a working correlation R, and let V(R) be the corresponding robust sandwich variance estimate. QIC, Pan (Biometrics, 2001) In the case of conditional logistic regression, the quasi-likelihood under independence criterion (QIC) is defined as ˆ QIC = −2Q{βˆ (R)} + 2trace{ΩI V(R)}, where Q{βˆ (R)} is the log-likelihood (under R = I) evaluated at β = βˆ (R). We choose the model with the smallest QIC.
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
Where is Yellowstone?
Parc national de Yellowstone
Montana Wyoming
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
Purpose of the analysis
Objectif • Déterminer si la cascade trophique observée dans le parc pouvait être causée par l’influence des loups sur les patrons de déplacements des wapitis.
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
What we are trying to show
Prédiction • L’augmentation des risques de rencontrer des loups diminue la probabilité que les wapitis visitent des peuplements de peupliers fauxtremble.
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
Data collection
10h00
Pa
15h00
s 5h00
00h00
Est-ce que les Pas sont placés de façon aléatoire dans le paysage?
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Example on elk travel in Yellowstone
The strata/matched sets Step Selection Functions. Fortin et al. 2005 Ecology 86(5): 1320-1330
Application
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Example on elk travel in Yellowstone
Results, GEE with backward elimination and QIC
Variable
β
Drtmin
0.744
Drtmin2
-0.056
Aspenend
0.338
Forestend
-0.289
Forestprop
-0.770
Sslope
-2.189
Aspenend × Wavg3
-0.885
Forestend × Wavg3
0.313
Wavg3
0.240
Probabilité relative
Fonction de sélection des Pas du wapiti durant l’hiver à Yellowstone Peuplier Forêt Ouvert
Indice de présence des loups
Conclusion
Outline
Introduction
GEE
GEE for conditional logistic regression
Application
Current/future research
Current/future research
Simulations to check when QIC is better than backward elimination. Link between prospective vs retrospective correlations. Estimation of correlation matrix parameters.
Conclusion