Lifetime Data Anal (2014) 20:106–131 DOI 10.1007/s10985-013-9255-7
On computing standard errors for marginal structural Cox models R. Ayesha Ali · M. Adnan Ali · Zhe Wei
Received: 22 January 2011 / Accepted: 27 March 2013 / Published online: 18 April 2013 © Springer Science+Business Media New York 2013
Abstract In recent decades, marginal structural models have gained popularity for proper adjustment of time-dependent confounders in longitudinal studies through time-dependent weighting. When the marginal model is a Cox model, using current standard statistical software packages was thought to be problematic because they were not developed to compute standard errors in the presence of time-dependent weights. We address this practical modelling issue by extending the standard calculations for Cox models with case weights to time-dependent weights and show that the coxph procedure in R can readily compute asymptotic robust standard errors. Through a simulation study, we show that the robust standard errors are rather conservative, though corresponding confidence intervals have good coverage. A second contribution of this paper is to introduce a Cox score bootstrap procedure to compute the standard errors. We show that this method is efficient and tends to outperform the non-parametric bootstrap in small samples. Keywords Cox models · Inverse-probability-of-treatment weights · Marginal structural models · Time-dependent weights · Score bootstrap
R. A. Ali (B) Department of Mathematics and Statistics, University of Guelph, Guelph, ON, Canada e-mail:
[email protected] M. A. Ali Social Dynamics, Toronto, ON, Canada Z. Wei Canada Post, Ottawa, ON, Canada
123
On standard errors for MSCMs
107
1 Introduction In many longitudinal settings with time-varying exposures, time-dependent covariates may simultaneously be intermediate variables in the causal pathway from past treatment to outcome, and confounders of the effect of future treatment on outcome. Such variables are called time-dependent confounders of the causal effect of the exposure on outcome (Robins 1992). The presence of these time-dependent confounders complicates estimation of the treatment effect. Robins (1992) showed that model estimates may be biased regardless of whether the time-dependent confounders are included or excluded from the set of explanatory variables in the marginal model. However, the use of inverse-probability-of-treatment weights (IPTWs) in a marginal structural Cox model (MSCM) can properly account for these time-dependent covariates (Robins et al. 2000; Hernan et al. 2000). We refer the reader to Hernan et al. (2000) and Xiao et al. (2010) for good reviews of MSCMs. In this paper, we only highlight details needed to motivate and understand the computation of standard errors for these models. When the marginal model is a Cox model, standard routines in statistical software packages such as SAS, SPlus, and R, can easily be used to compute parameter estimates, but typically the standard errors are not directly computed using Cox model routines because these routines were not developed to account for time-dependent weights (Robins 1998, p. 9). Robins et al. (2005, p. 2207) proposed modifying the partial likelihood function of the Cox model to accommodate time-dependent weights, but the mathematical details of the corresponding fitting algorithm were not presented. However, a pooled logistic regression model can approximate a Cox model if the probability of a subject’s death or censorship in any single time period is small (D’Agostino et al. 1990). Hence, in practice, researchers opt instead to fit generalized estimating equation models with time-dependent weights and an independence working correlation matrix. The robust standard error option of the above-mentioned software programs can produce conservative 95 % Wald confidence intervals for the causal parameter of interest (Hernan et al. 2000; Robins 1998). Alternatively, one could use a non-parametric bootstrap procedure to estimate the standard errors (Cook et al. 2002; Hogan and Lee 2004; Petersen et al. 2007). Standard statistical software packages (Insightful 2001; R Development Core Team 2010) use the formulation of Andersen and Gill (1982) to fit Cox models, which is based on a stochastic counting process. In particular, the pseudo-maximum likelihood estimation procedure derived by Binder (1992) is implemented in most statistical software to allow for the adjustment of sampling bias in complex study designs using (time-independent) case weights. Through a comprehensive simulation study, Xiao et al. (2010) compared estimation based on inputting time-dependent weights into Binder’s algorithm for case weights to an unweighted model (in which time-dependent confounding is ignored) and to the pooled logistic regression approximation. They found that effect estimates from the weighted Cox model yielded lower empirical standard errors (ESEs), and with lower bias in some cases, compared to the pooled logistic regression approximation. However, both the weighted Cox model and pooled logistic regression model had smaller bias but systematically much higher variance compared to the unweighted Cox model. As such, they advocated further research into variance reduction techniques for MSCMs. Note that these results were based
123
R. A. Ali et al.
108
on comparing the bias, ESEs, and root mean square errors (RMSEs) of parameter estimates based on the point estimates from 2,000 simulated samples. In this paper we focus on computing standard errors from a single data set. We adapt the pseudo-maximum likelihood procedure of Binder (1992) to estimate parameters from a Cox model with time-dependent weights. Section 2 provides a brief review of marginal structural models. Section 3 extends the parameter estimation algorithm for (time-independent) case weights to the time-dependent setting. We implemented this procedure in R and found that the coxph routine can compute the correct robust standard errors, though they tend to be too conservative. Hence, as is often done in practice, it may be preferable to resort to bootstrapping to obtain standard errors. Section 4 briefly reviews the non-parametric bootstrap for survival data, and introduces a score bootstrap procedure for Cox models. Section 5 describes our simulation study and compares the results of the robust method to that of the bootstrap methods. Finally, Sect. 6 discusses and summarizes the results of this paper. 2 Background We are interested in estimating the causal effect of time-varying treatment (or exposure) A on the outcome Y in the presence of baseline variables B and a time-dependent confounder L. Let T denote the observed failure time or censoring time of a subject and for ease of exposition, assume all variables except T are binary. We also assume that there are no unmeasured time-dependent confounders of the effect of A on Y . 2.1 Marginal structural Cox model A classic example arises in the context of HIV-positive subjects, where A is a treatment aimed to raise subjects’ CD4 counts, L is a second treatment used to treat opportunistic infections, possibly arising as side effects of treatment A, if needed, and Y is the time until a subject’s CD4 count goes above a given threshold. Clearly, L lies in the causal pathway from treatment status A to CD4 count. Further, due to subjects’ underlying health status, subjects with generally poorer health are more likely to both require treatment L and have lower CD4 counts. As such, L is a time-dependent confounder for the direct effect of A on CD4 count. For simplicity, we omit the subscript i to mark subjects. We use overbars to represent the history of a covariate during follow-up. For example, A¯ t = {Au , 0 ≤ u < t} is the treatment history from time 0 to time (t − 1). The conditional hazard at time t, given A¯ t and B, and baseline hazard h 0 (t) is h(t|At , A¯ t , B) = h 0 (t) exp(β1 f (a¯ t ) + β2 B),
(1)
where β1 and β2 may be vectors depending on the dimensions of f (a¯ t ) and B, and f (·) is a function that summarizes a subject’s treatment history. In the presence of the time-dependent confounder L t , the estimator βˆ1 is a biased estimate of the causal effect of At on CD4, regardless of whether or not L t is included as a covariate in Eq. (1). In particular, if L t is included in the marginal model, then βˆ1 may be biased
123
109
On standard errors for MSCMs
because L t is an intermediate variable. On the other hand, if L t is excluded from the marginal (unweighted) model, then βˆ1 may be biased because L t is a confounder. The above biases can be eliminated by adjusting for the time-dependent confounder through IPTWs; i.e., by fitting a Cox regression model weighted by the IPTWs (Hernan et al. 2000). The (stabilized) IPTW for a single subject at time t is given by: swt =
t k=0
pr (Ak | A¯ k = a¯ k , B = b) . pr (Ak | A¯ k = a¯ k , B = b, L¯ k = l¯k )
(2)
In the special case that the treatment effect is unconfounded ( i.e., L t is not a predictor of At ), the numerator would equal the denominator in Eq. (2), the stabilized IPTW would equal one, and the weighted model would theoretically be equivalent to the standard (unweighted) Cox model. The estimator obtained from using the above weights is consistent (Robins 1998; Hernan et al. 2000). Hernan et al. (2000) showed that timedependent weights could also be used to appropriately adjust for censoring due to loss of follow-up. In practice, the weights are often not known, and must be estimated from the data. If At is binary, then the parameter estimate from the marginal model, given the weights in Eq. (2), can be interpreted as the risk ratio associated with being on treatment versus never-treated. Parameter estimates associated with other variables in the marginal model do not have a causal interpretation. Xiao et al. (2010) proposed using Binder’s algorithm for case weights, but with weights given by swt N (t) , swt∗ = i∈R(t) swt,i where R(t) is the risk set at time t, swt,i is the stabilized weight swt for subject i, and N (t) is the size of the risk set at time t. Their results suggested that the variance of the MSCM estimator can be somewhat reduced by using weights swi∗ (t) (based on standard deviations from Monte Carlo simulations), though more research needs to be done in this direction. 3 Methodology We now review the estimation procedure for Cox models in order to set up notation and extend the algorithm to Cox models with time-dependent case weights. 3.1 Notation Suppose a (finite) population is comprised of N units and let ti be the observed failure time of subject i, i = 1, . . . , N . Assume that subjects are ordered such that t1 < t2 < · · · < t N ; and let δi be the censoring (or failure) indicator such that δi = 1 if subject i is an observed failure, and δi = 0 if subject i is censored.
123
R. A. Ali et al.
110
For each subject, Yi (t) is a survival indicator such that Yi (t) = 1 if t ≤ ti ; Yi (t) = 0 otherwise. Let X ik (t) be the kth covariate of subject i, where i = 1, . . . , N and k = 1, . . . , p; X ik (t) could be either time-fixed or time-dependent. Then, Xi (t) = [X i1 (t), X i2 (t), . . . , X i p (t)] is the covariate vector for the ith subject at time t. Each subject in the population is then represented by (δi , Yi (t), Xi (t)), for 0 ≤ t ≤ ti , i = 1, . . . , N . If the data are distributed with hazard function h(t|x(t)) = h 0 (t) exp(β x(t)), then the partial likelihood function is L=
N
N
k=1 Yk (ti ) exp(β
i=1
δi
exp(β xi (ti )) x (t )) k i
,
which is maximized by solving the associated estimating equation given by: ˆ = U(β)
N i=1
δi
S (1) (βˆ , ti ) Xi (ti ) − S (0) (βˆ , ti )
= 0,
(3)
where S (r ) (βˆ , t) =
N
Yi (t)Xri (t) exp(βˆ Xi (t)), for r = 0, 1.
(4)
i=1
Suppose a sample of size n is drawn from a finite population of N units with unequal probability πi for each subject i. Then, in practice, we cannot compute the partial likelihood because we have only observed data from a sample size of n < N subjects. Lin and Wei (1989) proposed a robust covariance estimator for the Cox regression model when the model is misspecified. Binder (1992) extended these results to survey sampling settings with unequal, but known selection probabilities per subject. Further, to find the design-based variance of (βˆ − β), Binder (1983) used Taylor series expansions to obtain an appropriate estimate of the associated score that is consistent, but slightly biased. We adapt this procedure to the time-dependent case weight setting, and summarize the relevant details below. 3.2 Cox model with time-dependent case weights Assume that wi (t), i = 1, . . . , n, represent time-dependent weights such that the contribution to the partial likelihood of subject i at risk at time t j is weighted by wi (t j ) with j = i possible, and that these weights are known. Robins et al. (2005, p. 2207) showed that the IPTWs swi (t), i = 1, . . . , n are such time-dependent weights. However, we further assume that the weights are nconstructed such that the sum of the wi (t) = 1. Xiao et al. (2010) also weights at any given time point is one, i.e., i=1 proposed normalizing weights swi (t) at each time step, but further added a factor of N (t) to the numerator in swi∗ (t).
123
111
On standard errors for MSCMs
Analogous to Binder (1992), we replace Eqs. (3), (4) with weighted sums weighted by wi (ti ), i = 1, 2, . . . , n as follows: ˜ β, ˆ ti ) = U(
n i=1
S˜ (1) (βˆ , ti ) δi wi (ti ) Xi (ti ) − S˜ (0) (βˆ , ti )
= 0,
(5)
where S˜ (r ) (βˆ , t) =
n
wi (t)Yi (t)Xri (t) exp(βˆ Xi (t)), for r = 0, 1.
(6)
i=1
Because we replace S (r ) in the partial likelihood function by the weighted sums ˜S (r ) , r = 0, 1, the resulting parameter estimates are pseudo-maximum likelihood ˜ β, ˆ ti ) with respect to βˆ is: estimates. The negative partial derivative of U( ⎧ ⎫ ⎪ ⎨ S˜ (2) (βˆ , t ) S˜ (0) (βˆ , t ) − [ S˜ (1) (βˆ , t )]2 ⎪ ⎬ i i i δi wi (ti ) J˜ = ,
2 ⎪ ⎪ ⎩ ⎭ i=1 S˜ (0) (βˆ , ti ) n
(7)
where S˜ (2) (βˆ , t) =
n
wi (t)Yi (t)(Xi (t))
2
exp(βˆ Xi (t)),
i=1
and, for any column vector α, α 2 = αα . ˆ the To obtain the finite population pseudo-maximum likelihood estimator β, Newton–Raphson algorithm can be used to solve the non-linear system of equations ˜ β) ˆ = 0 with update at the tth iteration given by: U( ˜ βˆ (t−1) ). βˆ (t) = βˆ (t−1) + J˜ −1(t−1) U(
(8)
Further, the design-based variance estimator of (βˆ −β) then takes the form J˜ −1 VU J˜ −1 , ˜ β)). ˆ In the next section, we show how to estimate VU . where VU = var(U( 3.3 Variance estimation For all asymptotic assumptions to hold, we assume that we have a sequence of populations and sample designs such that both N and n are large. Note that estimating Eq. (5) can be written as ˜ β) ˆ = U(
n
ˆ = 0, wi (ti )u i (β)
(9)
i=1
123
R. A. Ali et al.
112
ˆ is a function of weighted sums that depends on all the ti ’s. Analogous to where u i (β) ˆ that Lin and Wei (1989) and Binder (1992), we derive an alternate expression for U˜ (β) ˆ but such that the u i ’s do not depend on all the is asymptotically equivalent to U˜ (β), ti ’s. In particular, using Taylor series expansions we derive the following expression ˆ in Eq. (9): for u i (β) ˆ = δi Xi (ti ) − u i (β)
ˆ t) S (1) (β, (0) ˆ t) S (β,
wi (t)Yi (t)Xi (t) exp(β Xi (t)) dG(t) ˆ t) S˜ (0) (β, 0 ⎧ ⎫ ⎪ ∞ ⎪ ⎨ (1) ˆ t) exp(β Xi (t)) ⎬ wi (t)Yi (t) S˜ (β, 1 dG(t), +
2 ⎪ wi (ti ) ⎪ ⎩ ⎭ ˆ t) S˜ (0) (β,
1 − wi (ti )
∞
(10)
0
where ˜ G(t) =
n i=1
wi (t)G i (t), G i (t) =
1 t ≥ ti , δi = 1 . 0 otherwise
(11)
N ˜ G(t) is a weighted sum that estimates G(t) = i=1 G i (t), and G i (t) effectively counts the number of subjects at risk of failure at time t among those who are observed to fail at some time. We defer the derivation of Eq. (10) to Appendix 1. Let W be an n×n weight matrix and W = diag(w1 (t1 ), . . . , wn (tn )). Using the design-based method, we can estimate VU and re-write it in terms of W as follows: VU =
n 2 ˜ β). ˜ β) ˆ ˆ ˆ WW U( wi (ti )u i (β) = U(
(12)
i=1
Note that as n → N and N → ∞, it can be verified from Eq. (12) that VU → 0. Therefore, βˆ is generally a consistent, though not necessarily unbiased, estimator of β. ˜ β) ˆ is commonly known as the score residuals. To obtain an estimate The product W U( ˜ ˆ in practice, estimates of S (0) , S (1) of G can be substituted in of the variance of U(β) the first term and in the last two terms, respectively, of Eq. (10) as follows:
ˆ ti ) S˜ (1) (β, ˆ = δi Xi (ti ) − u i (β) ˆ ti ) S˜ (0) (β, n wi (t j )Yi (t j )Xi (t j ) exp(β Xi (t j )) 1 δ j w j (t j ) − ˆ tj) wi (ti ) S˜ (0) (β, j=1
123
113
On standard errors for MSCMs
⎧ ⎫ ⎪ ⎪ ⎨ (1) ˜ ˆ wi (t j )Yi (t j ) S (β, t j ) exp(β Xi (t j )) ⎬ 1 + δ j w j (t j ) .
2 ⎪ ⎪ wi (ti ) ⎩ ⎭ ˜S (0) (β, ˆ tj) j=1 n
(13)
We implemented the above algorithm in R using the Breslow approximation for handling ties (Breslow 1974), and determined that the coxph routine can, in fact, correctly compute the (robust) asymptotic standard errors in the time-dependent weight setting. In other words, the coxph routine in R provides an asymptotically consistent, but slightly biased estimate of the causal effect of treatment on outcome in the presence of weighting. See Appendix 2 for a worked out example to validate this claim. Unfortunately, as is common with clustered data, the robust method does not perform well, particularly in small samples; see Sect. 5.1 (Kline and Santos 2011). Hence, it may be beneficial to rely on bootstrap procedures to obtain standard errors for parameter estimates. In the next section we propose a new score bootstrap procedure that is computationally much more efficient than the non-parametric bootstrap for timeinterval data. 4 Bootstrapping longitudinal data Conceptually, the non-parametric bootstrap may be the simplest bootstrap procedure to implement because it simply requires generating B bootstrap samples of n subjects, reestimating β in each bootstrap sample, and then computing the empirical standard error or some other quantity of interest (e.g. Wald statistic, empirical confidence interval) using all B samples. Cheng and Huang (2010) proved consistency of the bootstrap for general semiparametric M-estimators for the class of exchangeable bootstrap weights (of which the non-parametric bootstrap is a special case). Note that in practice, the non-parametric bootstrap can often bootstrap samples that have many ties or in which the Hessian is poorly behaved. Consequently, there are situations in which the fitting procedure does not converge in some bootstrap samples, thereby making the non-parametric bootstrap difficult to use in practice (Kline and Santos 2011). Another practical concern for time-interval data is that a lot of computation time may be required just to build the data set containing the bootstrapped sample because more than one row in the data frame is typically used for each subject. Further, the time-dependent weights need to be re-estimated in each bootstrap sample before the pseudo-likelihood is re-optimized. To avoid these practical problems, we now present a score bootstrap procedure for Cox models which tends to be comparable or have better coverage than the non-parametric bootstrap and can be orders of magnitude faster than the former. 4.1 Cox score bootstrap Kline and Santos (2011) proposed the score bootstrap and established the consistency of this procedure for Wald and Lagrange Multiplier type tests and tests of moment restrictions for a wide class of M-estimators under clustering and potential
123
R. A. Ali et al.
114
Algorithm 1 Cox score bootstrap for standard error of βˆ Require: n = num. of subjects, δ, Y, X per Sec. 3.1,
B = num. of bootstraps, w per Sec. 3.2,
f τ = density function, per Eq. (14). Return: standard error of βˆ 1: function Cox Score Bootstrap(n, B, δ, Y, X, w, f τ ) ˜ β), ˜ and βˆ ˆ J, 2: compute U( from δ, Y, X, w and Eqs. (5), (7), (8), resp. 3: for b = 1 to B − 1 do generate bootstrap sample b 4: for i = 1 to n do apply bootstrap weights to score 5: sample τi ∼ f sample a bootstrap weight ˆ (b) ← τi U˜ i (β) ˆ 6: U˜ i (β) perturbed score contribution 7: 8: 9: 10: 11:
end for ˜ β) ˆ (b) βˆ(b) ← βˆ − J˜ −1 U( end for B β sb ← B1 b=1 βˆ(b) B 1 2 ˆ return b=1 (β(b) − β sb ) B−1
single bootstrap iteration store empirical mean return empirical std. dev.
12: end function
misspecification. The score bootstrap is closely related to the wild bootstrap (Wu 1986). While the wild bootstrap requires sampling (with replacement) B sets of n weights to perturb the residuals, the score bootstrap requires sampling (with replacement) B sets of n weights to perturb the subject score contributions. In both procedures, weights τi are distributed with E[τi ] = 0 and E[τi2 ] = 1, for i = 1, . . . , n.
(14)
Note that the wild bootstrap provides a means of producing a set of bootstrap residuals that mimic the heteroscedasticity inherent in the true errors. However, the residuals affect the limiting distribution of the OLS estimator (for linear models) through the score. Hence, the wild bootstrap can be interpreted as a perturbation of ˆ that leaves the the score contributions, evaluated at the original sample estimate β, associated Hessian unchanged. While this interpretation of wild bootstrap is not easily generalizable to nonlinear models, the interpretation in terms of the perturbed score contributions is. The consistency of the score bootstrap heavily relies on its connection with the wild bootstrap. Here, we propose a Cox score bootstrap for the semi-parametric setting of Cox models. Each subject’s bootstrap weight is applied to all of that subject’s score contributions; hence the τi are essentially random case weights. We summarize the Cox score bootstrap for the standard error of βˆ as Algorithm 1. In this paper, we only consider perturbing the score contributions using Rademacher weights for τi which take on values 1 and −1, each with probability one half. Note that the score contributions in step 3 correspond to those used in the Newton–Raphson algorithm for point estimation, and only uncensored subjects contribute to it. In fact,
123
On standard errors for MSCMs
115
each bootstrap estimate βˆ(b) can be thought of as a single Newton–Raphson step from βˆ using the original Hessian, but the perturbed score. Just as with the wild bootstrap, the score bootstrap does not require re-sampling cases, thereby ensuring each bootstrap sample retains the same dependence structure as in the original data. Further, because every subject is included in every bootstrap sample, there is no need to re-estimate the time-dependent weights. The Cox score bootstrap is typically more stable than the non-parametric bootstrap and orders of magnitude faster. In the next section we compare the behaviour of the robust standard errors derived in Sect. 3.2 to that of the Cox score bootstrap and the non-parametric bootstrap through a simulation study.
5 Simulation study We generated data according to Young et al. (2008), in which there were n subjects, each with a maximum of M follow-up visits. Let Y j be an indicator of failure by time j, A j be a binary treatment variable and L j a binary time-varying confounder, each over the interval ( j, j + 1). Let T represent failure time, and T0 represent the counterfactual survival time under the never-treated regime. The algorithm for generating simulation data is given in Algorithm 2; see Young et al. (2008, 2009) or Xiao et al. (2010) for details. Briefly, for each subject, T0 is sampled from an exponential distribution and at each time step t, the subject’s values for L t and At are sampled from a Bernoulli distribution with probabilities p L = Pr (L t = 1|L t−1 , At−1 , Yt = 0) and p A = Pr (At = 1|L t , L t−1 , At−1 , Yt = 0) respectively, where logit ( p L ) = γ0 + γ1 I [T0 < c] + γ2 At−1 + γ3 L t−1 , and
(15)
logit ( p A ) = α0 + α1 L t + α2 L t−1 + α3 At−1 .
(16)
As in Young et al. (2009), we set (γ0 , γ1 , γ2 , γ3 ) = (log(3/7), 2, log(0.5), log(1.5)), (α0 , α1 , α2 , α3 ) = (log(2/7), 0.5, 0.5, log(4)), and c = 30. The cumulative hazard is updated and compared to T0 to determine whether the subject failed in (t, t +1]. When fitting models to the data, the time-dependent weights were the stabilized weights of Young et al. (2009), as per Eq. (2). In particular, the weights were estimated based on the fitted values obtained from separate logistic regression models for the numerator and denominator of the weights. For all runs we generated R = 1,000 samples, and for each sample we generated B = 999 bootstrap samples. We compared the standard errors and 95 % confidence intervals produced by the robust method, Cox score bootstrap and non-parametric bootstrap. For both bootstrap procedures, we computed ESEs, and empirical 95 % confidence intervals from the B bootstrap samples. Based on the R samples, we also computed the average bias, ESE and RMSE of the estimate of β output by coxph. In our simulation study, we performed 24 runs which were comprised of all combinations of the parameter values shown in Table 1. To improve computational speed of the non-parametric bootstrap, we implemented a parallelized version of the procedure using a master-slave configuration. All code was
123
R. A. Ali et al.
116
Algorithm 2 Simulation Study Data Generation. Require: n = number of subjects,
M = maximum follow-up time,
β = hazard ratio (treated to never-treated),
λ = baseline hazard,
c = confounding inducing parameter. Return: ID = subject ID ,
T = failure time for uncensored subjects,
t = interval [t − 1, t),
At−1 , At = binary treatment value at t − 1, t,
Yt = failure indicator, L t−1 , L t = binary confounder value at t − 1, t. 1: function Simulate Data(n, M, β, λ, c) 2: for all ID ∈ [1, n] do for all subjects do 3: L −1 ← A−1 ← Y0 ← 0 initialization 4: sample T0 ∼ Exp(λ) counterfactual never-treated failure time 5: for t = 0 to M do for each observation 6: p L = Pr L t = 1 L t−1 , At−1 , Yt = 0 see Eqn. (15) 7: sample L t ∼ Bernoulli( p L ) L t is confounding status at time t 8: p A = Pr(At = 1|L t , L t−1 , At−1 , Yt = 0) see Eqn. (16) 9: sample At ∼ Bernoulli( p A ) At is treatment status at time t t+1 cumulative hazard to (t + 1) 10:
← 0 exp(β A j )d j 11: if T0 > then 12: Yt+1 ← 0 subject did not fail in (t, t + 1] 13: else 14: Yt+1 ← 1 subject did fail in (t, t + 1] 15: T ←t +
record actual failure time 16: break exit inner for loop 17: end if 18: end for 19: end for 20: return ID, t, Yt , T, At−1 , At , L t−1 , L t data from all subjects 21: end function Table 1 Parameter values used to simulate data
Parameter
Parameter settings
n
250, 2500
M
5, 10
β
0.01, 0.10
λ
−0.3, 0.0, 0.3
written in the R software and all simulations were performed using 64 core machines equipped with AMD Opteron 2.2GHz chips in a CentOS 5.x O/S environment.
5.1 Results Table 2 presents the average bias and RMSE of the parameter estimates for all 24 runs. Note that the statistics provided for n = 2,500 and M = 10 were consistent with those found by Xiao et al. (2010) and Young et al. (2009). Also, the estimation method developed in Sect. 3.2 produced relatively unbiased estimates for large n. However, for small n, a longer follow-up time may be needed to obtain estimates with low bias.
123
117
On standard errors for MSCMs
Table 2 Bias and RMSE of estimate βˆ using coxph, with stabilized time-dependent weights averaged over 1,000 simulations λ
M
β
n = 250 Bias
0.01
5
10
0.10
5
10
n = 2,500 RMSE
Bias
RMSE
−0.3
−0.0449
0.7546
−0.0068
0.2073
0.0
−0.0006
0.6554
0.0027
0.1914
0.3
0.0123
0.6309
−0.0026
0.1834
−0.3
−0.0295
0.5258
0.0012
0.1544
0.0
−0.0165
0.4865
−0.0073
0.1496
0.3
0.0273
0.4692
0.0002
0.1361
−0.3
−0.0037
0.2237
0.0007
0.0720
0.0
−0.0037
0.2157
0.0017
0.0658
0.3
−0.0049
0.2052
−0.0018
0.0623
−0.3
−0.0002
0.1901
0.0006
0.0573
0.0
0.0009
0.1742
0.0015
0.0552
0.3
0.0008
0.1692
−0.0010
0.0532
As expected, the bias and RMSE were larger for smaller sized data sets or for rarer diseases. Table 3 presents the standard errors obtained from all of the methods. Note that the ‘Empirical’ standard errors were obtained from the R = 1,000 simulations, whereas the other standard errors were those obtained by the respective methods, averaged over the R = 1,000 simulations. Since all biases were low in general, the RMSEs reported here were very similar to the corresponding empirical standard errors. The robust standard errors were highly inflated relative to the ESEs. Interestingly, the average of the robust standard errors did not decrease markedly with an increase from n = 250 to n = 2,500 subjects, though the empirical and bootstrapped standard errors did. For n = 2,500, both bootstrapped methods produced standard errors that were in line with the empirical ones, but the trend was different for n = 250. The non-parametric bootstrap standard errors were inflated with small sample size when the disease rate was low (λ=0.01), particularly when M, the maximum follow-up time, was low. Table 4 presents the empirical coverage probabilities observed for the different methods. Overall, the confidence intervals produced by our robust method had good coverage probabilities, particularly for rare diseases (λ = 0.01). The coverage probabilities for the Cox score bootstrap were generally good, though slightly lower than the robust method. This poorer performance seemed to stem from the Cox score bootstrap producing slightly narrower confidence intervals relative to the robust method. Table 5 summarizes the distributions for confidence interval length obtained by the different methods. For small samples (n = 250), the non-parametric bootstrap tended to have poor empirical coverage probabilities despite the fact that some of the confidence intervals were very wide. Further inspection of these intervals showed that
123
R. A. Ali et al.
118
Table 3 Comparison of standard errors of βˆ using coxph with stabilized time-dependent weights, based on 1,000 simulations λ
n
250
0.01
M
5
10
0.10
5
10
2,500
0.01
5
10
0.10
5
10
β
−0.3
Estimation method Empirical
Average robust
Cox score bootstrap
Non-parametric bootstrap
0.7536
0.9442
0.7157
3.8877
0.0
0.6558
1.2428
0.6509
2.7286
0.3
0.6311
1.6731
0.5978
2.0470
−0.3
0.5253
0.8226
0.5085
0.8598
0.0
0.4865
1.1071
0.4722
0.6361
0.3
0.4687
1.5542
0.4424
0.5630
−0.3
0.2238
0.7567
0.2318
0.2359
0.0
0.2157
1.0197
0.2167
0.2194
0.3
0.2052
1.3719
0.2041
0.2067
−0.3
0.1902
0.7542
0.1837
0.1852
0.0
0.1743
1.0161
0.1740
0.1756
0.3
0.1693
1.3704
0.1674
0.1686
−0.3
0.2073
0.7516
0.2109
0.2132
0.0
0.1915
1.0214
0.1939
0.1951
0.3
0.1834
1.3691
0.1805
0.1814
−0.3
0.1545
0.7505
0.1584
0.1586
0.0
0.1495
1.0039
0.1468
0.1467
0.3
0.1361
1.3627
0.1379
0.1377
−0.3
0.0720
0.7433
0.0726
0.0728
0.0
0.0658
1.0038
0.0676
0.0677
0.3
0.0623
1.3501
0.0639
0.0638
−0.3
0.0573
0.7425
0.0574
0.0574
0.0
0.0552
1.0031
0.0545
0.0546
0.3
0.0532
1.3504
0.0525
0.0524
the empirical distribution of the βˆ(b) ’s obtained from the B = 999 bootstrap estimates tended to be highly skewed resulting in confidence intervals that were often close to β, but did not cover it (details not shown). However, this instability was not apparent in larger samples. It is worth noting here that convergence of the parameter estimates in the bootstrap samples was tracked and we found that all parameter estimates converged. It may be that some of the parameter estimates in the bootstrap samples converged before the likelihood did, particularly for n = 250. However, such instability is not apparent in the Cox score bootstrap because all subjects are included in every bootstrap sample, thereby retaining all the structure that is in the original data.
123
119
On standard errors for MSCMs
Table 4 Coverage probabilities of 95 % confidence intervals for β using coxph with stabilized timedependent weights, averaged over 1,000 simulations λ
n
250
0.01
M
5
10
0.10
5
10
2,500
0.01
5
10
0.10
5
10
β
−0.3
Estimation method Average robust
Cox score bootstrap
Non-parametric bootstrap
0.958
0.941
0.945
0.0
0.970
0.963
0.953
0.3
0.961
0.956
0.941
−0.3
0.956
0.950
0.943
0.0
0.944
0.944
0.943 0.934
0.3
0.943
0.941
−0.3
0.954
0.951
0.954
0.0
0.945
0.951
0.945
0.3
0.946
0.946
0.942
−0.3
0.944
0.943
0.942
0.0
0.953
0.949
0.952
0.3
0.940
0.943
0.942
−0.3
0.958
0.954
0.954
0.0
0.953
0.954
0.955
0.3
0.953
0.951
0.947
−0.3
0.951
0.952
0.949
0.0
0.940
0.941
0.935
0.3
0.965
0.958
0.955
−0.3
0.949
0.947
0.947
0.0
0.955
0.954
0.958
0.3
0.962
0.960
0.961
−0.3
0.954
0.958
0.955
0.0
0.947
0.948
0.948
0.3
0.945
0.944
0.947
The average computational time needed to perform the bootstrap procedures are shown in Table 6. Clearly, the Cox score bootstrap can be orders of magnitude faster than the non-parametric bootstrap, even after parallelizing across 64 cores. In particular, the Cox score bootstrap took less than one minute to run, in serial, regardless of the sample size. The computation times given in Table 6 are somewhat inflated because more than just the βˆ values were recorded from each run. Further, for the non-parametric bootstrap, models were fit using both weighting schemes swt and swt∗ within each bootstrap iteration. However, the effects of these extra computations are minimal since most time is spent on constructing a data set containing the bootstrapped sample.
123
123
2,500
0.01
250
0.01
0.10
λ
n
10
5
10
5
10
5
M
(0.697, 0.824, 1.021) (0.663, 0.759, 0.882)
−0.3 0.0
(0.490, 0.540, 0.608)
(0.574, 0.651, 0.809)
0.3
(0.507, 0.575, 0.658)
(0.592, 0.679, 0.800)
0.0
0.3
(0.606, 0.715, 0.864)
−0 .3
0.0
(0.689, 0.796, 0.982)
0.3
(0.628, 0.706, 0.851)
(0.736, 0.845, 0.981)
0.0
(0.547, 0.618, 0.724)
(0.740, 0.905, 1.119)
−0.3
0.3
(1.306, 1.697, 3.140)
0.3
−0.3
(1.356, 1.817, 3.377)
0.0
(0.468, 0.541, 0.642)
(0.496, 0.577, 0.693)
(0.513, 0.620, 0.730)
(0.615, 0.708, 0.848)
(0.645, 0.760, 0.897)
(0.645, 0.829, 1.034)
(0.551, 0.655, 0.771)
(0.587, 0.682, 0.800)
(0.627, 0.722, 0.880)
(0.676, 0.801, 0.960)
(0.718, 0.850, 1.012)
(0.749, 0.909, 1.122)
(1.249, 1.694, 3.116)
(1.282, 1.812, 2.935)
(1.379, 1.949, 3.170)
(1.453, 2.266, 4.054)
(1.529, 2.282, 4.478) (1.388, 1.957, 3.310)
0.3 −0.3
(1.733, 2.430, 4.365)
(1.742, 2.446, 4.631)
(1.680, 2.637, 4.446)
(1.741, 2.688, 5.078)
Cox score bootstrap
0.0
Average robust
Estimation method
−0.3
β
(0.471, 0.541, 0.625)
(0.493, 0.577, 0.694)
(0.534, 0.623, 0.755)
(0.601, 0.713, 0.900)
(0.655, 0.768, 0.930)
(0.690, 0.839, 1.076)
(0.570, 0.661, 0.819)
(0.582, 0.690, 0.858)
(0.604, 0.727, 0.870)
(0.671, 0.814, 1.014)
(0.734, 0.863, 1.050)
(0.746, 0.929, 1.160)
(1.326, 1.843, 20.150)
(1.345, 2.006, 21.080)
(1.440, 2.207, 21.790)
(1.584, 2.770, 40.690)
(1.864, 3.081, 41.310)
(1.868, 3.841, 41.520)
Non-parametric bootstrap
Table 5 Distributions of 95 % confidence interval lengths (min, median, max) for β using coxph with stabilized time-dependent weights, averaged over 1,000 simulations
120 R. A. Ali et al.
n
5
0.10
10
M
λ
Table 5 continued
(0.271, 0.285, 0.303) (0.252, 0.265, 0.277) (0.236, 0.250, 0.266) (0.214, 0.225, 0.239) (0.204, 0.214, 0.227) (0.197, 0.206, 0.216)
0.0 0.3 −0.3 0.0 0.3
Average robust −0.3
β
(0.186, 0.207, 0.227)
(0.195, 0.214, 0.239)
(0.206, 0.225, 0.249)
(0.226, 0.252, 0.281)
(0.239, 0.266, 0.291)
(0.258, 0.286, 0.322)
Cox score bootstrap
Estimation method
(0.186, 0.206, 0.237)
(0.190, 0.215, 0.242)
(0.205, 0.226, 0.249)
(0.224, 0.252, 0.282)
(0.235, 0.267, 0.295)
(0.256, 0.286, 0.320)
Non-parametric bootstrap
On standard errors for MSCMs 121
123
R. A. Ali et al.
122
Table 6 Average processing time (min) needed to obtain standard errors for βˆ using coxph with stabilized time-dependent weights, averaged over 1,000 simulations. Non-parametric bootstrap was parallelized across 64 cores; score bootstrap was performed in serial λ
n
M
β
Estimation method Cox score bootstrap
250
0.01
5
10
0.10
5
10
2,500
0.01
5
10
0.10
5
10
Non-parametric bootstrap
−0.3
0.065
1.831
0.0
0.067
2.100
0.3
0.072
2.122
−0.3
0.074
3.155
0.0
0.074
3.519
0.3
0.074
3.116
−0.3
0.076
2.318
0.0
0.071
2.182
0.3
0.074
2.008
−0.3
0.078
2.766
0.0
0.083
2.699
0.3
0.086
3.351
−0.3
0.453
17.291
0.0
0.461
18.918
0.3
0.440
16.912
−0.3
0.543
36.101
0.0
0.523
33.167
0.3
0.527
30.798
−0.3
0.482
14.561
0.0
0.523
20.873
0.3
0.550
15.117
−0.3
0.643
25.851
0.0
0.575
21.732
0.3
0.570
24.613
Although not presented here, we examined the behaviour of the robust and bootstrap methods for computing standard errors using weights swt∗ and found the results very similar to using the stabilized weights of Robins et al. (2000). 6 Discussion This paper was motivated by the need to fit time-dependent Cox models with timedependent weights as per MSCMs. Robins et al. (2000) showed that, in the absence of censoring, an unbiased and consistent estimator of the unknown parameter β can be obtained by fitting an ordinary time-dependent Cox model in Eq. (1) such that the contribution of a subject to a calculation performed on subject i at risk at time t is
123
On standard errors for MSCMs
123
weighted by swt , as defined in Eq. (2), with ti > t added to the conditioning event. However, no standard error calculations in the current statistical software packages were designed to allow for time-varying weights. This paper theoretically develops an algorithm to compute robust standard errors for parameter estimates from a MSCM, and we found that the current implementation of coxph in the R software can readily compute these robust standard errors. As with the approach of Binder (1992), the resulting effect estimates are consistent, but slightly biased. However, our simulation results suggest that this bias is minimal in large data sets, or even in small data sets with long follow-up. Note that the robust variance estimator can be shown to be an approximation of the jackknife estimate of variance, and is a natural extension of the analogous quantity in the case weight setting. Further, Therneau and Grambsch (2001, p. 159) explain that the actual jackknife estimate tends to overestimate the variance for small n, whereas the estimate based on the score residuals performs quite well. In the case weight or unweighted settings, our variance estimator reduces to the same robust variance estimators given by coxph in R. In our simulations, the robust standard errors tended to be highly inflated relative to the ESEs. We introduced the Cox score bootstrap as a means of efficiently obtaining the correct standard errors. We compared the robust method to the Cox score bootstrap and to the more conventional non-parametric bootstrap. The Cox score bootstrap consistently produced standard errors in line with the ESEs, whereas the non-parametric bootstrap showed some instability on smaller samples, especially in the low disease rate setting. The Cox score bootstrap is also far more computationally efficient compared to the non-parametric bootstrap. We implemented a parallelized version of the nonparametric bootstrap, parallelized across 64 cores. One may think that further parallelization may have added benefit; however, per Amdhal’s Law (Amdhal 1967), as the number of CPUs increases the communication costs eventually consume a significant fraction of the runtime. Further, the Cox score bootstrap was at times orders of magnitude faster than the non-parametric bootstrap, even though it was implemented serially. Although the robust standard errors tended to be rather conservative, we found that the associated confidence intervals for β were better behaved and had better coverage probabilities in general, compared to both bootstrap methods. The 95 % confidence intervals produced by the Cox score bootstrap tended to be only marginally narrower than that of the robust method, but at the cost of poorer coverage. This phenomenon has been noted before by Burr (1994), who compared several bootstrap methods for confidence intervals in Cox models and found that none of the methods improved upon the robust (asymptotic) method, though the bootstrap methods could closely compete with it. Our current implementation of the Cox score bootstrap has only considered Rademacher weights, though other bootstrap weights may warrant further investigation. Although not needed in the simulation study presented herein, it is also possible to compute an estimate of the covariance matrix of the parameter estimates in the marginal model by applying the bootstrap weights to the score residuals, and then using the perturbed score residuals to obtain VU in Eq. (12). In fact, the Cox score bootstrap could be interpreted as a wild bootstrap for the score residuals.
123
R. A. Ali et al.
124
In conclusion, we have shown that our robust method produces confidence intervals with good coverage probabilities, but standard errors that are rather conservative. The Cox score bootstrap closely competes with the robust method with respect to coverage, but produces much more accurate standard errors. The Cox score bootstrap can also be orders of magnitude faster than the non-parametric bootstrap. It would be interesting to see how the Cox score bootstrap could be used for hypothesis testing, though we leave this problem for future work. Acknowledgments We thank Tony Desmond, Gerarda Darlington, Babette Brumback and 2 anonymous reviewers for helpful comments. We also thank Erica Moodie for providing code to generate data and Thomas Gerds for his input on computing issues. Simulations were performed on the Shared Hierarchical Academic Research Computing Network (SHARCNET: www.sharcnet.ca) through Compute/Calcul Canada. This work was supported by NSERC.
Appendix 1 ˆ in (9). First, we re-express (5) as follows: We now derive expression (10) for u i (β) ˜ β) ˆ = U(
n i=1
∞ ˜ (1) ˆ S (β, t) ˜ wi (ti )δi Xi (ti ) − d G(t) = 0, ˆ t) S˜ (0) (β, 0
˜ where G(t) is as defined in (11). Suppressing the brackets in functions, and taking a first-order Taylor series expansion of (5) around S˜ (0) = S (0) , S˜ (1) = S (1) , and G˜ = G, we have ˜ ˜ ˆ ˆ U(β) = U(β)
∂U ˜ ∂ S˜ 0 S (0) ,S (1) ,G S (0) ,S (1) ,G
∂U
˜ ˜ ˜ − G ∂U + S˜ (1) − S (1) + G , ∂ S˜ 1 S (0) ,S (1) ,G ∂ G˜ S (0) ,S (1) ,G
+ S˜ (0) − S (0)
(17)
where ∞ (1) ˜ S ∂U = 2 dG(t), (0) ∂ S˜ S (0) S (0) ,S (1) ,G ˜ ∂U ∂ S˜ (1)
0
∞ S (0) ,S (1) ,G
=−
1 dG(t), and S (0)
0
˜ ∂U S (1) = − (0) . S ∂ G˜ S (0) ,S (1) ,G The remainder terms of the Taylor series expansion are negligible because S˜ (0) , S˜ (1) and G˜ are consistent estimates of S (0) , S (1) and G respectively. Plugging in the above
123
125
On standard errors for MSCMs
n ˜ the partial derivatives, and noting that i=1 wi (ti )δi (S (1) /S (0) ) = S (1) /S (0) G, right-hand side of Eq. (17) can be reduced to n
wi (ti )δi Xi (ti ) −
i=1
S (1) S (0)
∞ ˜ (0) (1) ∞ ˜ (1) S S S dG(t). G(t) + 2 dG(t) − (0) (0) S S 0
(18)
0
After substituting Eqs. (6) for S(0) and S (1) in the last two terms of Eq. (18), ∞ ˜ and interchanging the order approximating the second term with 0 (S (1) /S (0) )d G(t), of integration and summation, we get n ˆ t) S (1) (β, ˆ wi (ti )δi Xi (ti ) − u i (β) = ˆ t) S (0) (β, i=1
n ∞ wi (t)Yi (t)Xi (t) exp(β Xi (t)) − dG(t) ˆ t) S˜ (0) (β, i=1 0 ⎧ ⎫ ∞⎪ ⎪ n ⎨ ˆ t) exp(β Xi (t)) ⎬ wi (t)Yi (t) S˜ (1) (β, + dG(t).
2 ⎪ ⎪ ⎩ ⎭ ˆ t) i=1 S˜ (0) (β, 0
ˆ in (9), where u i (β) ˆ is as given in Hence, (18) is asymptotically equivalent to U˜ (β) (10), i.e. ˆ t) S (1) (β, ˆ = δi Xi (ti ) − u i (β) ˆ t) S (0) (β, wi (t)Yi (t)Xi (t) exp(β Xi (t)) dG(t) ˆ t) S˜ (0) (β, 0 ⎧ ⎫ ⎪ ∞ ⎪ ⎨ (1) ˜ ˆ wi (t)Yi (t) S (β, t) exp(β Xi (t)) ⎬ 1 + dG(t).
2 ⎪ wi (ti ) ⎪ ⎩ ⎭ (0) ˜ ˆ S (β, t) 0
1 − wi (ti )
∞
Further, since ˆ = E[u i (β)]
n i=1
ˆ wi (ti )u i (β)
ˆ t) S (1) (β, = wi (ti )δi Xi (ti ) − ˆ t) S (0) (β, i=1 ∞ ˜ (1) ˆ ˆ t) S˜ (0) (β, ˆ t) S (1) (β, S (β, t) − − dG(t), ˆ t) ˆ t) S (0) (β, ˆ t) S (0) (β, S (0) (β, n
0
123
R. A. Ali et al.
126
then we have ˆ → E[u i (β)]
N ˆ t) 1 S (1) (β, δi Xi (ti ) − , as n → N and N → ∞, ˆ t) N S (0) (β, i=1
which by (3) equals zero. ˜ β) ˆ is a consistent, though not necessarily unbiased estimator In other words, U( of 0. It can easily be seen that if weights swi∗ (t) were used, the resulting coefficient estimates would still be consistent and slightly biased. Appendix 2 In this appendix, we show how one can do parameter estimation using the Breslow approximation, and then provide a toy example that demonstrates that coxph can accommodate time-dependent weights when computing asymptotic standard errors. First, we re-write the partial log-likelihood, score vector and Fisher information matrix such that we can easily compute them from data. There is a term in the (partial) likelihood function for every event. When there are multiple subjects who have an event at the same time, i.e., event times are tied, the Breslow approximation does not assume that the exact time of any death is unique. Hence the contribution to the likelihood is simply the ratio of each subject’s score to the sum of scores for all subjects at risk just before the event time ( i.e., any subject for which Y (ti ) = 1 for event time ti ). The log-likelihood is computed as follows (comparing to Eq. (3.1)): LL =
n
Wi Ai ,
i
n where Wi = δi wi (ti ), and Ai = Bi − ln C (t ) with j i j Bi = β Xi (ti ), and C j (t) = w j (t)Y j (t) exp(β X j (t)). Define X¯ (t) =
n j=1
n
X j (t)C j (t)
j=1 C j (t)
.
It is easy to verify that X¯ (t) = S˜ (1) (βˆ , t)/ S˜ (0) (βˆ , t). Further, we can re-write the respective score vector in Eq. (5) and Fisher information matrix in Eq. (7) as follows: U =
n i=1
123
Wi {X i (ti ) − X¯ i (ti )}, and
127
On standard errors for MSCMs
J =
n
n
2 C (t ) j i j=1 (X j (ti )) n j=1 C j (t)
Wi
i=1
− X¯ (ti )
2
.
In fact, we can simplify the information matrix even further as follows: X k j (ti )X l j (ti )C j (ti ) n Jkl = − X¯ k (ti ) X¯ l (ti ) , and Wi j=1 C j (t) i=1 n n 2 j=1 X k j (ti )C j (ti ) 2 n − X¯ k (ti ) , Wi Jkk = j=1 C j (ti ) n
n
j=1
i=1
for k, l = 1, . . . , p where X k j (ti ) is subject j’s value of the kth covariate. Similarly, X¯ k (ti ) is the kth component of X¯ (ti ). For variance estimation of the model coefficients, we re-write Eq. (9) as: ˆ = U˜ (β)
n
Wi X i (ti ) − X¯ (ti )
i=1
−
n n i=1 j=1
=
n
⎡
n n X i (t j )Ci (t j ) X¯ (t j )Ci (t j ) + W j n W j n k=1 C k (t j ) k=1 C k (t j )
i=1 j=1
⎣Wi X i (ti ) − X¯ (ti ) −
i=1
n j=1
Wj
⎤ Ci (t j ) ⎦. X i (ti ) − X¯ (ti ) n k=1 C k (t j )
ˆ for i = 1, . . . , n. Let U˜ be a n × 1 vector containing the ith contribution to U˜ (β), Then we have, n Ci (t j ) U˜ i = Wi X i (ti ) − X¯ (ti ) − . W j X i (t j ) − X¯ (t j ) n k=1 C k (t j ) j=1
The final sandwich estimator for producing robust variance estimates is given by in this section, V = (J −1 VU˜ J −1 ) = (J −1 U˜ )(U˜ J −1 ). Using the equations detailed in the examples that follow we will need to compute Wi , Bi , C j (ti ), nj=1 C j (ti ) as well as X¯ (ti ). We will use these quantities in estimating parameters from the data set presented in the next section. Worked out implementation of fitting a MSCM to data The first six columns of Table 7 present a data set that contains eight subjects observed over 1–6 time intervals, comprising 23 observations. There are two covariates: x1 is a binary baseline variable, while x2 is binary but time-dependent. The column ‘wt’ shows the time-dependent weights associated with each subject at each visit. For convenience, subjects are ordered based on their respective failure times. The remaining five columns
123
R. A. Ali et al.
128 Table 7 Example of survival data with time-dependent weights ID
Data Time
Event
x1
x2
wt
Quantities for parameter estimation X¯ 1 Cj Bi Cj
1
(0,1]
1
1
0
3
β1
2
(0,1]
0
1
1
5
5r1 r2
3
(0,1]
0
1
1
6
6r1 r2
3
(1,2]
1
1
1
8
4
(0,1]
0
0
0
2
2
4
(1,2]
0
0
1
2
2
4
(2,3]
0
0
1
4
5
(0,1]
0
0
0
2
2
5
(1,2]
0
0
0
2
2
5
(2,3]
0
0
1
2
5
(3,4]
1
0
1
4
6
(0,1]
0
0
0
4
4
6
(1,2]
0
0
0
5
5
6
(2,4]
0
0
1
8
6
(4,5]
1
0
1
8
7
(0,1]
0
1
0
2
2r1
7
(1,3]
0
1
0
2
2r1
7
(3,4]
0
1
0
3
3r1
7
(4,5]
0
1
0
4
4r1
8
(0,1]
0
0
0
3
3
8
(1,3]
0
0
0
3
3
8
(3,4]
0
0
0
6
6
8
(4,6]
0
0
0
6
6
β1 + β2
β2
3r1
8r1 r2
4r2
X¯ 2
d1
d1 −11 d1
11r1 r2 d1
d3
d3 −12 d3
8r1 r2 d3
d5
3r1 d5
12r2 d5
d6
4r1 d6
8r2 d6
8r2 β2
8r2
detail much of the preliminary calculations needed for parameter estimation, and used implicitly in future calculations. Since there are four failures in the data, there are four terms in the log-likelihood. Let d1 = 11r1r2 + 5r1 + 11, d3 = 8r1r2 + 2r1 + 12, d5 = 3r1 + 12r2 + 6 and d6 = 4r1 + 8r2 + 6. The corresponding log-likelihood, score vector and Fisher information matrix are as follows: L L = 3(β1 − ln d1 ) + 8(β1 + β2 − ln d3 ) + 4(β2 − ln d5 ) + 8(β2 − ln d6 ) = 11β1 + 20β2 − 3 ln d1 − 8 ln d3 − 4 ln d5 − 8 ln d6 U1 = 3 1 − X¯ 1 (1) + 8 1 − X¯ 1 (2) + 4 0 − X¯ 1 (4) + 8 0 − X¯ 1 (5) U2 = 3 0 − X¯ 2 (1) + 8 1 − X¯ 2 (2) + 4 1 − X¯ 2 (4) + 8 1 − X¯ 2 (5)
J11 = 3 X¯ 1 (1) − X¯ 12 (1) + 8 X¯ 1 (2) − X¯ 12 (2) + 4 X¯ 1 (4) − X¯ 12 (4)
+8 X¯ 1 (5) − X¯ 12 (5)
123
129
On standard errors for MSCMs
Table 8 Intermediate calculations for updating parameter estimates β
i
ti
Wi
di
X¯ 1
X¯ 2
0
1
1
3
27
2/3
11/27
3
2
8
22
10/22
20/22
5
4
4
21
1/7
12/21
6
5
8
18
4/18
8/18
1
1
3
180.56757
0.939081
0.890089
3
2
8
132.42668
0.909384
0.882663
5
4
4
110.40566
0.048076
0.897579
6
5
8
79.14234
0.089423
0.834764
βˆ
Table 9 Update statistics
β=0
β = βˆ −59.96284
LL
−69.91691
U1
3.01443
0.0
U2
9.30014
0.0
J11
4.52260
1.66533
J22
5.66265
2.59323
J12
1.27422
0.03275
J22 = 3 X¯ 2 (1) − X¯ 22 (1) + 8 X¯ 2 (2) − X¯ 22 (2) + 4 X¯ 2 (4) − X¯ 22 (4)
+8 X¯ 2 (5) − X¯ 22 (5) J12 = 3 X¯ 2 (1) 1 − X¯ 1 (1) + 8 X¯ 2 (2) 1 − X¯ 1 (2) − 4 X¯ 1 (4) X¯ 2 (4) −8 X¯ 1 (5) X¯ 2 (5) Setting U = 0 and solving for β we find that βˆ1 = 0.5705749 and βˆ2 = 2.1112007. Before computing L L , U and J we perform preliminary calculations for the four observed failure times in Table 8. The values of these statistics at the initial and final parameter estiamtes are provided in Table 9. Table 10 contains the score residuals for each subject. Finally, we can compute the variance–covariance matrix for the parameters using V =J
−1
U˜ U˜ J −1 =
#
$ 1.66533 0.03275 , 0.03275 2.59323
giving the standard errors of βˆ1 and βˆ2 as 0.082976 and 1.320177, respectively. If the data were analyzed in R and stored in a coxph.object called fit, then the quantities evaluated in this section could be compared to the corresponding coxph output as follows:
123
R. A. Ali et al.
130 Table 10 Score residuals
id
U˜ 1
U˜ 2
1
0.177385
−2.591770
2
−0.073941
−0.133406
3
−0.003668
−0.109349
4
0.141077
0.136221
5
−0.051226
0.423330
6
0.232819
0.299038
7
0.200866
1.010800
8
0.276302
0.905734
X¯ = coxph.detail(fit)$means L L = fit$loglik U J βˆ1 , βˆ2 U˜
= sum(coxph.detail(fit)$score) = fit$var = summary(fit)$coeff = residuals(fit, collapse = mydata$id, weighted = T, type = “score”)
D = J −1 U˜ = residuals(fit, collapse = mydata$id, weighted = T, type = “dfbeta”) The MLE calculated here matches that of coxph, as do the standard errors. Hence, this appendix demonstrates that the R code for computing the standard errors in coxph can accommodate time-dependent weights.
References Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120 Binder DA (1983) On the variances of asymptotically normal estimators from complex surveys. Int Stat Rev 51(3):279–292. http://www.jstor.org/stable/1402588 Binder DA (1992) Fitting Cox’s proportional hazards models from survey data. Biometrika 79(1):139–147 Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30(1):89–99. http://www.jstor. org/stable/2529620 Burr D (1994) A comparison of certain bootstrap confidence intervals in the Cox model. J Am Stat Assoc 89(428):1290–1302 Cheng G, Huang J (2010) Bootstrap consistency for general semiparametric M-estimation. Ann Stat 38(5):2884–2915. doi:10.1214/10-AOS809 Cook NR, Cole SR, Hennekens CH (2002) Use of a marginal structural model to determine the effect of aspirin on cardiovascular mortality in the physicians’ health study. Am J Epidemiol 155(11):1045–1053. doi:10.1093/aje/155.11.1045, http://aje.oxfordjournals.org/cgi/content/abstract/ 155/11/1045, http://aje.oxfordjournals.org/cgi/reprint/155/11/1045.pdf D’Agostino RB, Lee M, Belanger AJ, Cupples A (1990) Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham heart study. Stat Med 9(12):1501–1515
123
On standard errors for MSCMs
131
Hernan MA, Brumback B, Robins JM (2000) Marginal structural models to estimate the causal effect of zidovudine on the survival of hiv-positive men. Epidemiology 11(5):561–570. http://www.jstor.org/ stable/3703998 Hogan J, Lee J (2004) Marginal structural quantile models for longitudinal observational studies with time-varying treatment. Stat Sin 14:927–944 Insightful (2001) S-PLUS 8: guide to statistics, vol 2. Insightful Corporation, Seattle Kline P, Santos A (2011) A score based approach to wild bootstrap inference. Tech. Rep. NBER-TWP16127. The National Bureau of Economic Research Lin DY, Wei LJ (1989) The robust inference for the Cox proportional hazards model. J Am Stat Assoc 84(408):1074–1078. http://www.jstor.org/stable/2290085 Petersen M, Deeks S, Martin J, Van der Lann M (2007) History-adjusted marginal structural models for estimating time-varying effect modification. Am J Epidemiol 166(9):985–993 R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org. ISBN 3-900051-07-0 Robins J (1992) Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika 79(2):321–334. doi:10.1093/biomet/79.2.321, http://biomet. oxfordjournals.org/cgi/reprint/79/2/321.pdf Robins J (1998) Marginal structural models. In: 1997 Proceedings of the section on Bayesian statistical science. American Statistical Association, Alexandria, pp 1–10 Robins JM, Hernan MA, Brumback B (2000) Marginal structural models and causal inference in epidemiology. Epidemiology 11(5):550–560. http://www.jstor.org/stable/3703997 Robins J, Hernan M, Siebert U (2005) Effects of multiple interventions. In: Ezzati M, Lopez A, Rodgers A, Murray C (eds) Comparative quantification of health risks: global and regional burden of diseases attributable to selected major risks, chap 28. World Health Organization, Geneva, p 2207 Therneau TM, Grambsch PM (2001) Modeling survival data: extending the Cox model. Springer, New York Wu C (1986) Jackknife, bootstrap and other resampling methods in regression analysis (with discussions). Ann Stat 14:1161–1350 Xiao Y, Abrahamowicz M, Moodie E (2010) Accuracy of conventional and marginal structural Cox model estimators: a simulation study. Int J Biostat 6(2):1–28 Young J, Hernan M, Picciottol S, Robins J (2008) Simulation from structural survival models under complex time-varying data structures. In: JSM proceedings, section on statistics in epidemiology, American Statistical Association, Denver Young J, Hernan M, Picciotto S, Robins J (2009) Relation between three classes of structural models for the effect of a time-varying exposure on survival. Lifetime Data Anal 16(1):71–84
123