Simulating survival data with predefined censoring rates for proportional hazards models Fei Wan1,* 1 Department of Biostatistics, University of Arkansas for Medical Sciences, Little Rock, AR , U.S.A *
[email protected]
Abstract The proportional hazard model is one of the most important statistical models used in medical research involving time-to-event data. Simulation studies are routinely used to evaluate the performance and properties of the model and other alternative statistical models for time-to-event outcomes under a variety of situations. Complex simulations that examine multiple situations with different censoring rates demand approaches that can accommodate this variety. In this paper, we propose a general framework for simulating right-censored survival data for proportional hazards models by simultaneously incorporating a baseline hazard function from a known survival distribution, a known censoring time distribution, and a set of baseline covariates. Specifically, we present scenarios in which time to event is generated from exponential or Weibull distributions, and censoring time has a uniform or Weibull distribution. The proposed framework incorporates any combination of covariate distributions. We describe the steps involved in nested numerical integration and using a root-finding algorithm to choose the censoring parameter that achieves predefined censoring rates in simulated survival data. We conducted simulation studies to assess the performance of the proposed framework. We demonstrated the application of the new framework in a comprehensively designed simulation study. We investigated the effect of censoring rate on potential bias in estimating the conditional treatment effect using the proportional hazard model in the presence of unmeasured confounding variables.
1/23
1
January 26, 2018
1
Introduction
2
3
Clinical researchers often work with survival data that measures the time until an event occurs. Examples of these events include death, tumor recurrence, stroke onset, hospital discharge, and transition above or below a clinical threshold such as HIV viral load. The distinguishing feature of survival data is that often, the precise time to event is observed for only part of the sample. For some study participants, the exact time when the event of interest occurred is available, whereas for others, all we know is that the event did not occur in a certain period of time. For example, study participants may still be alive at the end of the study period, they may have experienced a competing event that made further follow up impossible, or they may have been lost to follow up during the study. These incomplete observations in survival data are often referred to as censored observations. The most common censoring mechanism in medical research is right censoring, in which the event of interest has not occurred by the end of the study and time to event is known only to be greater than the censoring time. The distribution of the time to an event can be described by either a density function of a parametric distribution or a hazard function. Because of censoring, modeling the time to an event through the hazard function is more convenient [1]. The Cox proportional hazards model is by far the most popular method for modeling censored survival data. The Cox model is parametrized in terms of the hazard function and quantifies the relationship between risk factors and participants’ event time. The properties of proportional hazard models and other novel survival methods are routinely evaluated with simulation studies using data from a variety of situations, particularly data subject to different censoring rates. For example, Hsieh et al. [2] and Vaeth et al. [3] investigated sample size and power for testing the null hypothesis of no association in a proportional hazards model under different censoring proportions. They concluded that censored observations do not contribute to the power of test statistics. Jiang et al. [4] compared the performance of covariates adjusted in nonparametric logrank and Wilcoxon tests with performance in the Cox model for power and type I error rates for treatment group comparisons, under the proportional hazard assumption and when the assumption is violated. The three competing methods were also evaluated via simulation under different censoring rates. Censoring rate has no significant impact on the type I error rate of these methods but higher power is associated
2
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
with lower censoring rates. Kong et al. [5] and Hu et al. [6] assessed the consistency of different estimators of the regression parameters in a Cox proportional hazard model when some covariates are measured with errors. Censoring rate had minimal effect on biases [5, 6]. Lin et al. [7] analyzed the bias of estimating the conditional treatment effect using the Cox proportional hazard model when some confounding variables are not included in the model. The study found that as censoring becomes heavier the potential bias actually decreases. Lin et al. [9] showed that biases in estimating treatment effects using a Cox model depend on censoring rates even when some independent covariates are omitted from the model. Based on the simulations from these previous studies, two independent survival distributions are required when simulating randomly right-censored survival data for proportional hazard models. The first distribution is for survival time T and the second is for censoring mechanism C. Previous studies present a very flexible framework for generating time-to-event data from known distributions for proportional hazard models by inverting the cumulative baseline hazard function [10, 11]. The hazard function is first specified with a baseline hazard function h0 (t) and relative risk component exp(β 0 X). Because survival function S(t) = exp(−H0 (t)exp(β 0 X)) is a probability distribution function that maps a number in the domain [0, ∞] to a probability between 0 and 1, we can generate random numbers from a standard uniform distribution in the interval [0, 1] and then invert survival function S(t) to transform uniformly distributed random numbers into event times. The baseline hazard function h0 (t) can come from exponential, Weibull, or more complex distributions. The covariates X in a Cox model are often generated as binary variables [12], normally distributed continuous variables [2, 6], or more realistically, a mixture of discrete and continuous covariates [4]. In previous studies, the common approach to simulating right-censored data was to generate noninformative censoring times randomly from a particular distribution such as a uniform ∼ U nif orm(0, θ) [7, 13, 14], exponential ∼ Exp(θ) [4, 15], or Weibull distribution ∼ W eibull(α, θ) [16, 17]. The value of the censoring parameter θ is often selected to achieve the desired nominal censoring proportions in the simulated data. Recent studies on simulating survival data have focused on algorithms simulating the event time [10, 11, 18]. Few studies elucidate in detail whether the value for the censoring parameter is determined by numerical methods, closed form formula, or manual tuning. There is a general paucity of literature investigating the methodologies for generating censored data with desirable censoring proportions. In this paper, we propose a general framework for simulating right-censored survival times with predefined censoring rates by integrating a given censoring time distribution into the survival data generation process. In section (2), we derive expressions for the individual censoring probability and the censoring proportion in a study population assuming the baseline hazard function for the time to event is from a flexible Weibull distribution. We discuss scenarios in which censoring times are generated from uniform or Weibull distributions and covariates are binary, normally distributed continuous variables, or any arbitrary combination of discrete and continuous variables. We present numerical 3/23
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
techniques combining a root-finding algorithm with nested numerical integration to solve for censoring parameters that yield a predefined censoring proportion. In section (3.2), we conduct a comprehensively designed simulation study, using the procedures described in the section (2), to assess how potential factors such as censoring rates, size of treatment effect, and effect magnitudes of unmeasured confounders on treatment and outcomes impact the bias of estimating treatment effect using the Cox proportional hazard model in the presence of unmeasured confounding. We discuss implications and applications of our framework in section (4).
2 2.1
Assumptions and Methods
84 85 86 87 88 89 90 91 92
93
A general framework
94
In this section, we introduce a general framework for simulating survival data with predefined censoring proportions for proportional hazard models. We let T denote the event time. We assume the hazard function for the ith subject is given by the following multiplicative model: h(t|Xi ) = h0 (t)exp(β T Xi )
(1)
where h0 (t) is baseline hazard function, X =< X1 , X2 , . . . , Xm > is a m−dimensional vector for covariates, and β =< β1 , β2 , . . . , βm > is the corresponding m−dimensional coefficient vector. The component exp(β T X) characterizes how covariates may influence the hazard function. The distribution function of the event time defined by model (1) is: F (t|Xi ) = 1 − exp(−H0 (t)exp(β T Xi ))
(2)
Rt where the cumulative baseline hazard function is H0 (t) = 0 h0 (t)dt. Next, we let C denote the censoring time with density function ∼ g(c|θ), where θ is a censoring parameter. The censoring time C is assumed to be independent of the event time T . Common choices for censoring distribution include uniform distribution ∼ U nif orm(0, θ) or Weibull distribution ∼ W eibull(k, θ). We let Y = min(T, C) be the observed follow-up time and δ = I(T ≥ C) be the censoring indicator with δ = 1 if ith subject is censored, and 0 otherwise. We propose a general framework to simulate censored survival data with a predefined censoring proportion in three steps: (1) derive the censoring probability for each subject conditional on subject specific covariates; (2) derive a censoring rate function in the study population by marginalizing covariates out from the individual censoring probability function; and (3) solve the censoring rate function for the censoring parameter for a given censoring proportion.
4/23
95 96 97 98 99 100 101 102 103 104 105 106 107
2.1.1
Derivation of individual censoring probability P(δ = 1|Xi , θ).
108
The probability of being censored for the ith subject with given covariates Xi in the study population is derived as: P(δ = 1|Xi , θ) = P(C ≤ T ≤ ∞, 0 ≤ C ≤ ∞) Z ∞ Z ∞ = g(c|θ) f (t|Xi )dtdc c Z0 ∞ = g(c|θ)exp(−H0 (c)exp(β T Xi ))dc
(3)
0
The individual censoring probability P(·) defined by equation (3) is a function of covariates X and censoring parameter θ. To derive the censoring rate in the entire study population, individual specific covariates X need to be integrated out. 2.1.2
Derivation of censoring rate P(δ = 1|θ).
109 110 111 112
113
Given the expectation of the individual-level censoring probability P(δ = 1|Xi , θ) with respect to covariates Xi , the censoring rate in the study population is: P(δ = 1|θ) = EXi (P(δ = 1|Xi , θ)) Z = P(δ = 1|x, θ)fx (x)dx
(4)
D
where fx (·) is the joint density function of m−dimensional covariates X and “D” is the domain of X. The censoring rate in the study population is a function of only the censoring parameter θ. However, instead of working with the density function fx (·) of high-dimensional covariates X, it is simpler to work with a single variable λi = exp(β T Xi ). The density function fλi (u) of λi can be derived analytically or estimated numerically with nonparametric smoothing methods, depending on the distribution functions specified for covariates. We present three scenarios in which the density functions of λi can be derived or estimated. We first present a general situation in which an arbitrary number of covariates can have arbitrary forms of distributions. We then discuss two special scenarios in which closed form density functions can be derived analytically for λi when covariates X are from i.i.d standard normal distribution or from Bernoulli distributions with different event probabilities. (I) Covariates from different distributions In more realistic scenarios, covariates Xi ’s can be generated from different distributions such as normal, Bernoulli, or Poisson, etc. However, it is difficult to derive the closed form expression for the probability Pn density function for the weighted sum of these variables, denoted by j=1 βj /αXj , and thus the density function fλi (u) for λi . Therefore, we use a kernel smoothing method to estimate the density function fλi (u) nonparametrically. A kernel is a symmetric, non-negative, real valued function with a definite integral over its domain equal to 1. A kernel density estimator can be constructed in the following three steps: 5/23
114 115 116 117 118 119 120 121 122 123 124 125 126
127 128 129 130 131 132 133 134 135
(1) We first need to select a kernel function K(·). A common choice is a Gaussian kernel, which is implemented as the default method in the density function in R. 2 1 K(u) = √ e−u /2 2π
(2) We then construct the scaled kernel function at each data point ui , where i = 1, 2, 3, · · · , M , h−1 K((u − ui )/h) Pn Recall that ui = exp(−β0 /α − j=1 βj /αXi,j ). h is a smoothing parameter called ”bandwidth.” A large value for h results in larger bias and smaller variance. A smaller h results in smaller bias and larger variance. A recommended rule of thumb for bandwidth selection [21] is ˆ = 1.06min(ˆ h σ , IQR/1.34)n−1/5 where IQR is the interquartile range and σ ˆ is the sample standard deviation.
136 137
(3) To estimate the density at a new point u, we sum all individual scaled kernel functions and divide by total sample size M . This step guarantees that the integral of the kernel density estimate is equal to 1 over its domain. M
1 X K((u − ui )/h) fˆλi (u) = M h i=1
(5)
where ui = exp(−β0 /α −
n X
βj /αXi,j )
j=1
The kernel estimator fˆλi (u) is a biased estimator of density fλi (u). The smaller bandwidth h implies a smaller bias but a larger variance for kernel estimators. (II) Normally distributed covariates In simulation studies, covariates X are often assumedPto follow the standard normal Pn distribution N (0, 1). In that n case, β0 /α + j=1 βj /αXj ∼ N (β0 /α, j=1 βj2 /α2 ) and scale parameter Pn λi = exp(−β0 /α − i=1 βj /αXj ) for the ith subject follows the following distribution: λi ∼ lnN (−β0 /α,
n X
βj2 /α2 )
(6)
j=1
6/23
138 139 140
where lnN (µ, σ 2 ) is a log normal distribution, i.e., ln(λi ) ∼ N (−β0 /α, The density function of λi is given by:
Pn
j=1
βj2 /α2 ).
1 (ln(µ) + β0 /α)2 ) fλi (u) = √ qP exp(− Pn n 2 j=1 (βj2 /α2 ) 2 2 u 2π j=1 βj /α (III) Binary covariates We assume all n covariates Xj s follow Bernoulli distribution B(pjP ). Since each Xj takes 0 or 1, the scale parameter λj = n exp(−β0 /α − j=1 βj /αXj ), as an exponential of the weighted sum of n independent Bernoulli random variables plus a constant term, can take one of 2n possible values. We can use notations from Khalid et al. [20] to derive an expression for probability mass function for λi : fλi (ul ) =
n Y
l
(j)
pj(n,2) (1 − pj )
(j)
1−l(n,2)
(7)
j=1
where ul = exp(−β0 /α −
n X
(j)
βj l(n,2) /α)
j=1 (j)
for l ∈ {0, 1, 2, · · · , 2n } and l(n,2) as the jth element in a n × 1 vector of lth realization of binary variables < X1 , X2 , · · · , Xn >. For examples, when (1) (2) l = 1, 1(n,2) = < 0, 0, · · · , 0 >, 1(n,2) = 0 and 1(n,2) = 0; and when l = 2, | {z }
141 142 143
n×1
(1)
(2)
2(n,2) = < 1, 0, · · · , 0 > and 2(n,2) = 1 and 2(n,2) = 0 (Details in Appendix {z } |
144
n×1
A.1).
145
The closed formed density functions could be more time-efficient alternatives for complex simulation studies involving a large number of scenarios. When we let D0 be the domain of λi , equation (4) can be re-expressed as P(δ = 1|θ) = Eλi (P(δ = 1|λi , θ)) Z = P(δ = 1|u, θ)fλi (u)du
(8)
D0
Equation (8) may not have a closed form expression. If the conditional censoring probability P(δ = 1|λi , θ) has a closed form expression, equation (8) is a single integral and can be evaluated numerically. An efficient and accurate numerical integration method is Gaussian quadrature, which approximates an integral by a weighted sum of function ϕ(ui ) evaluated at a set of m points ui called “nodes”: Z
b
ϕ(u)du ≈ a
m X
wi ϕ(ui )
i=1
7/23
146 147
where wi indicates weights. If interval [a, b] is finite, it must be changed into an interval over [-1,1] before applying Gaussian quadrature: Z b Z b−a 1 b−a b−a ϕ(u)du = ϕ( u+ )du 2 2 2 a −1 m b−a b−a b−aX wi ϕ( ui + ) ≈ 2 i=1 2 2 If the integral is improper (i.e, a = 0, b = +∞), we can either find a transformation for a change of interval or apply the Gaussian Laguerre formula using Z +∞ Z +∞ ϕ(u)du = e−u eu ϕ(u)du 0
0
≈
m X
wi eui ϕ(ui )
i=1
If P(δ = 1|λi , θ) has no closed form expression, equation (8) is a double integral with no explicit expression. Nonetheless, it can still be numerically evaluated. The simplest strategy for numerically evaluating a double integral is to evaluate it as a sequence of one-dimensional integrals. First, we evaluate the individual censoring probability P(δ = 1|λi , θ) over interval [0, ∞] using Gaussian quadrature for a given λi and censoring parameter θ. Next, we take the expectation of P(δ = 1|λi , θ) with respect to λi to compute P(δ = 1|θ) by taking the second numerical integration using Gaussian quadrature over domain D0 of λi instead.
156
2.1.3
157
Numerical solution of censoring parameter θ for a predefined censoring rate
148 149 150 151 152 153 154 155
158
Let p denote a given censoring proportion for the data to be simulated. To derive a value of θ in the censoring distribution g(c|θ) that yields censoring proportion p, we first set up a function γ(θ) as: γ(θ|p) = P(δ = 1|θ) − p Z = P(δ = 1|u, θ)fλi (u)du − p
(9)
D0
For any combination of individual censoring probability function P(δ = 1|λi , θ) and density function fλi (u) determined by a set of covariates X with known censoring distribution g(c|θ) and a baseline hazard function from a known survival distribution f (t|Xi ), we can solve γ(θ|p) = 0 for censoring parameter θ. However, censoring parameter θ cannot be solved explicitly in most scenarios. The Brent-Decker algorithm, implemented in “uniroot” function in R, is a popular hybrid root-finding method that combines the bisection method, the secant method and inverse quadratic interpolation. 8/23
159 160 161 162 163 164 165 166
2.2
Application: Baseline hazard function from a Weibull distribution
We assume baseline hazard h0 (t) = αν following density function:
−α α−1
t
167 168
from W eibull(α, ν) with the
t tα−1 exp(−( )α ) να ν The proportional hazard model incorporating covariates X can be specified as: f (t) = α
h(t) = αν −α tα−1 exp(Xβ) = α (νexp(−Xβ/α))
−α α−1
t
Thus, the density function of the survival time T for the ith subject given covariate vector Xi follows: α t f (t|Xi ) = tα−1 exp(− ) α {ν exp(−Xi β/α)} {ν exp(−Xi β/α)}α If we let β ∗ =< β0 , β > be an m + 1−dimensional vector, its first element, β0 , is the coefficient of the intercept term log(ν −α ). Let X∗ =< 1, X >. The conditional density function f (t|X) can be re-expressed equivalently as f (t|λi ) =
α α−1 t t exp(−( )α ) λα λ i i
(10)
where scale parameter λi = exp(−β ∗T X∗i /α). Thus, T ∼ W eibull(α, λi ). Instead of working with covariate vector X directly, it is more convenient to work with λi . The approaches to derive density functions of λi are outlined in section (2.1.2). 2.2.1
Individual censoring probability conditional on λi
169 170 171 172
173
With the baseline hazard from a Weibull distribution, the probability of being censored for the ith subject defined by equation (3) is given by Z ∞ c P(δ = 1|λi , α, θ) = g(c|θ)exp(−( )α )dc λ i 0 which is a function of λi and θ. The censoring time C is commonly assumed to follow Weibull and uniform distributions. Some examples of individual censoring probabilities are: (I) C ∼ W eibull(α, θ) The distribution of censoring time C has the same shape parameter α as the distribution of survival time T . Using previous results [19], the individual censoring probability for the ith subject can be expressed as: Z ∞ α α−1 c c P(δ = 1|λi , α, θ) = c exp(−( )α )exp(−( )α )dc α θ θ λi 0 1 = (11) 1 + (θ/λi )α 9/23
174 175 176
(II) C ∼ W eibull(k, θ) To accommodate more flexible censoring distributions, we can relax the constraint on the shape parameter of the censoring distribution by letting it differ from the shape parameter of the time-toevent distribution. The probability of being censored for the ith subject is: P(δ = 1|λi , α, θ) Z ∞ k k−1 c c = c exp(−( )k )exp(−( )α )dc k θ θ λi 0
(12)
which does not have a closed form expression.
177
(III) C ∼ U nif orm(0, θ) Uniform distribution is often used to describe the censoring time. In this case, censoring is assumed to occur in a time interval with equal likelihood. The density function of censoring time C is: g(c|θ) =
1 θ
0 Ci )/10000 for j = 1, 2, · · · , 1000. These estimated censoring proportions were P1000 averaged over the 1000 data sets as j=1 Pj /1000 to estimate the true censoring proportion. Simulation results are in Figure (5). The actual censoring proportions computed from simulations were very close to the nominal censoring proportions for all 9 scenarios: shape parameter α = {0.5, 1, 2} and censoring distribution from W eibull(α, θ), W eibull(k = 0.8, θ), or U nif orm(0, θ). Next, we compared the performances of algorithms using a closed form formula or nonparametric kernel density method for density functions of λi , computing the censoring parameter for a 50% censoring rate. For the 9 scenarios outlined above, we simulated 500 data sets (n = 10000) with all normally distributed or Bernoulli variables. Both the closed form formula and kernel density estimation approaches were used to evaluate the simulated data. The system times (in seconds) used for completing the computation for each data set Ti were recorded P500 and averaged as T¯ = i=1 Ti /500. The results are in Table 1. For all scenarios, results from the algorithm incorporating the kernel density estimation approach were very similar to results from the algorithm incorporating the closed form density function but the former approach took much longer to complete. Thus, for complex simulation studies [23], the algorithm with the closed form density function is a more time-efficient option.
3.2
Application: Bias analysis when estimating treatment effects using the Cox model with unmeasured confounding variable
Lin et al. [24] assessed the sensitivity of point estimates of treatment effects in an observational study on unmeasured confounding effects. For survival outcomes, the authors examined the effect of censoring on bias and found that bias decreased with a larger censoring proportion. We designed a comprehensive simulation in an observational study setting to evaluate the impact on possible bias in estimating conditional treatment effects using the Cox proportional hazard model. We evaluated the impact of the type of hazard function (increasing or decreasing), the size of the treatment effect, the magnitude of unmeasured confounding, the strength of association between unmeasured confounders and exposure variables, and censoring proportions. We let X = {X1 , X2 } be the observed confounders, and U be an unmeasured confounder. The hazard function of survival time T given binary treatment variable D, observed confounders X, and unmeasured confounder U was: h(t|T, X1 , X2 , U ) = h0 (t)exp(γD + β1 X1 + β2 X2 + β3 U ) where h0 (t) = α/ν α−1 was from W eibull(α, ν). The parameter of interest was γ, which measured the conditional treatment effect of exposure variable D; β3 measured the magnitude of the unmeasured confounding effect of U on survival 13/23
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
263 264 265
266 267 268 269 270 271 272 273 274 275
outcome. For simplicity, we assumed {X1 , X2 } ⊥ ⊥ U . We let C be the censoring time and Y = min(T, C) be the observed follow-up time. By this definition, we assumed T ⊥ ⊥ C. In addition, δ = I(T > C) was the censoring indicator, δ = 0 when a subject experienced an event, and δ = 1 when the subject was censored. The binary treatment variable D was generated by: E(D = 1|X1 , X2 , U ) = expit(α0 + α1 X1 + α2 X2 + α3 U ) where α3 measured the effect of U on exposure variable D. We followed the steps below to generate the data: (1) We generated 2 observed covariates X1 ∼ N (0, 1), X2 ∼ Bernoulli(0.5), and unmeasured confounder U ∼ N (0, 1). (2) In the treatment model, α0 = −1.2. The coefficients α1 = 0.2, α2 = −0.2, and |α3 | = {0.2, 0.6} represented small (odds ratio [OR]: 1.22) and large (OR: 1.82) effects of unmeasured confounding U . For outcome model (9), we let β0 = −0.23. Treatment effect γ was set to {0.2, 0.6}, representing small (hazard ratio [HR]:1.22) and large (HR: 1.82) effects of exposure variable D on outcomes given observed confounders X and unmeasured confounder U . We defined β1 = −0.2, β2 = 0.2, and |β3 | = {0.2, 0.6} representing small (HR:1.22) and large (HR: 1.82) effects of unmeasured confounding variable U on outcome. (3) To investigate the effect of the direction of association on potential bias, we set 16 possible scenarios of α3 and β3 combinations. For example, α3 = 0.2 and β3 = 0.2; α3 = 0.2 and β3 = −0.2; α3 = −0.2 and β3 = 0.2; or α3 = −0.2 and β3 = −0.2. (4) For each possible combination of regression coefficients, we generated time to event T according to equations (9) and (10). Baseline hazard function was simulated from W eibull(α, ν = 0.013) with shape parameter α = {0.8, 1.2}. Censoring time C was generated independently from U nif orm(0, θ) and θ was determined by the algorithm outlined in section (2) to yield censoring proportions at nominal level {0.1, 0.2, · · · , 0.8, 0.9}. (5) We examined a total of 576 scenarios (2 levels of treatment effect γ, 16 combinations of effects of unmeasured confounder variable U on outcome (“β3 ”) and exposure (“α3 ”), 2 shape parameters for increasing and decreasing hazard functions, and 9 censoring proportions (p). For each scenario, a sample of 10000 observations was generated and a Cox model fitted with exposure variable D and observed confounders X1 and X2 . Bias was the difference between the conditional treatment effect γ and its estimates. The process was repeated 1000 times. With shape parameter α = 1.2, Figure (5) reveals that biases of the estimates were associated with coefficients (“β3 ”, “α3 ”) of the unmeasured confounder U in both treatment and outcome models, as well as the sizes of treatment effect
14/23
276 277
278 279
280 281 282 283 284 285 286 287 288
289 290 291 292
293 294 295 296 297 298
299 300 301 302 303 304 305 306
307 308 309
γ, and censoring proportion ρ. The results for α = 0.8 were similar to Figure 5 (data not shown), indicating that the shape parameter α of the baseline hazard function was not related to bias because the baseline distribution used was inconsequential to the Cox proportional hazard model. Biases were large when the magnitudes of both β3 and α3 were large. Direction of bias was influenced by the signs of β3 and α3 . When β3 and α3 were both negative or both positive, the estimates tended to be larger than the true value of γ (Figure (5) (A) and (D)). When β3 and α3 had different signs, the estimates tended to be smaller than the true treatment effect (Figure 5 (B) and (C)). Large treatment effect sizes were associated with large bias only when β3 was larger, but the differences disappeared when the censoring rate increased. We observed that the effect of censoring rate ρ on bias depended on the magnitude and sign of α3 ,β3 , and even the size of the treatment effect. When β3 was small, censoring rate had no obvious effect on bias. When β3 was large and had the same sign as α3 , bias increased with increasing censoring rate (Figure 5 (A) and (D)). However, when the signs of β3 and α3 differed and the size of the treatment effect γ was large with an increasing censoring rate, bias either increased, remain unchanged, or decreased, depending on the size of the treatment effect γ (Figure 5 (B) and (C)).
4
Discussion
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328
329
As survival analysis methods become widely used by medical researchers to model time-to-event outcomes, simulation studies are increasingly applied to investigate the properties of these survival methods under a variety of conditions. Censoring rate is an important factor that is routinely investigated. Different methodologies have been proposed to simulate the time to an event of interest. In particular, numerical methods are used to generate survival times when baseline hazard functions are complex and no closed-form expression for time to event is available [22]. However, the methodologies that generate censored survival data with desired censoring rates are rarely reported. This paper describes a general framework for simulating censored survival data for a given censoring proportion. The approach relies on numerical integration and a root finding algorithm to determine an appropriate value for the censoring parameter. The framework is particularly useful when many scenarios are under investigation in a comprehensively designed simulation study, making manual tuning of censoring parameters for each scenario impractical. The new framework incorporates a variety of combinations of baseline hazard functions from flexible Weibull distributions, censoring distributions from uniform and Weibull distributions, and different types of covariates (i.e., normal, binary, and mixed). The Weibull distribution is the most widely used distribution in survival analysis [4, 15–17], and has considerable flexibility. The shape of the density function of Weibull distribution can be right skewed, symmetric, or left skewed and hazard function can be monotonic increasing, constant, or monotonic decreasing. Censoring times are also commonly generated from
15/23
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
uniform distribution because of its simplicity [7, 8, 13, 14]. Thus, we focus on Weibull baseline hazard and Weibull or Uniform censoring distribution in this study but the framework is potentially applicable to other complex hazard functions. We validated our general framework with a variety of simulation studies. The consistency between the censoring rates in the simulated data generated by the proposed approach and the predefined nominal censoring rates indicated good performance of the proposed procedure in practice. We demonstrated the usefulness of our approach for solving real-world problems using simulated censored survival data. Using these data, we assessed the bias of estimating treatment effects using the Cox proportional hazard model in the presence of unmeasured confounding. The censoring parameter was computed for 9 different rates for more than 500 simulation scenarios. To facilitate the use of the methods described in this article, example R codes for some complex scenarios are provided in the Appendix. Given that the properties of novel and conventional survival methods are routinely evaluated using different censoring rates, our new framework will be an important tool to help medical researchers generate censored survival data with desired censoring proportions.
5
Acknowledgment
353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370
371
We thank Dr. Chris Tachibana for proofreading the manuscript.
372
References
373
1. Lin DY. “Survival analysis.” in: Advanced Medical Statistics. World Scientific Publishing Company, 2003 2. Hsieh FY, Lavori PW. Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Control Clinical Trials 2000; 21:552-560 3. Vaeth M, Skovlund E. A simple approach to power and sample size calculations in logistic regression and Cox regression models. Statistics in Medicine 2004; 23:1781-1792 4. Jiang H, Symanowski J, Paul S, Qu Y, Zagar A, Hong S. The type I error and power of non-parametric logrank and Wilcoxon tests with adjustment for covariates–a simulation Statistics in Medicine 2008; 27:5850-5860.
374 375
376 377 378
379 380 381
382 383 384
5. Kong FH, Gu M. Consistent estimation in Cox proportional hazards model with covariate measurement errors Statistica Sinica 1999; 9:953-969.
386
6. Hu C, Lin DY Cox regression with covariate measurement error Scandinavian Journal of Statistics 2002; 29:637-655.
388
16/23
385
387
7. Lin DY, Psaty BM, Kronmal RA. Assessing the sensitivity of regression results to unmeasured confounders in observational studies Biometrics 1998, 54:948-963. 8. Gonen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression Biometrika 2005, 92:965–970. 9. Lin NX, Logan S, Henley WE. Bias and sensitivity analysis when estimating treatment effects from the cox model with omitted covariates Biometrics 2013, 69:850-860. 10. Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models Statistics in Medicine 2005, 24:1713-1723. 11. Crowther MJ, Lambert PC. Simulating biologically plausible complex survival data Statistics in Medicine 2013, 32:4118-4134. 12. Schemper M. Cox analysis of survival data with non-proportional hazard functions Journal of the Royal Statistical Society. Series D 1992, 41:455-465. 13. Cai J, Zeng D. Sample size/power calculation for case-cohort studies Biometrics 2004, 60:1015-1024. 14. Cai J, Zeng D. Additive mixed effect model for clustered failure time data Biometrics 2011, 67:1340-1351. 15. Zeng D, Cai J. Simultaneous modelling of survival and longitudinal data with an application to repeated quality of life measures. Lifetime Data Analysis 2005, 11:151-174. 16. Mori˜ na D, Navarro A. The R Package survsim for the Simulation of Simple and Complex Survival Data Journal of Statistical Software 2014, 59:1-20. 17. Lange C, Blacker D, Laird NM. Family-based association tests for survival and times-to-onset analysis Statistics in Medicine 2004, 23:179-189. 18. Hendry DJ. Data generation for the Cox proportional hazards model with time-dependent covariates: a method for medical researchers Statistics in Medicine 2014, 33:436-454. 19. Wan F, Small D, Bekelman JE, Mitra N. Bias in estimating the causal hazard ratio when using two-stage instrumental variable methods Statistics in Medicine 2015, 34:2235-2265. 20. Khalid L, Anpalagan A. Reliability-based decision fusion scheme for cooperative spectrum sensing IET Communications 2014, 8:2423-2432 21. Venables WN, Ripley BD, Modern Applied Statistics with S, 4th Edition, Springer: New York, 2002.
17/23
389 390 391
392 393
394 395 396
397 398
399 400
401 402
403 404
405 406
407 408 409
410 411
412 413
414 415 416
417 418 419
420 421
422 423
22. Crowther M.J., Lambert P.C. Simulating biologically plausible complex survival data. Statistics in Medicine 2013, 32:4118-4134 23. Wan F., Mitra N. An evaluation of bias in propensity scoreadjusted non-linear regression models. Stat Methods Med Res. 2016,pii: 0962280216643739.
0.10 0.00
0.05
Probability
0.15
0.20
24. Lin D.Y., Psaty B.M., Kronmal R.A. Assessing the sensitivity of regression results to unmeasured confounders in observational studies. Biometrics 1998, 54:948-963
0
5
10
15
20
25
30
35
λ
Red dash line is Gaussian kernel density estimates of λ = exp(βX) based on 10000 simulated samples. Figure 1:
18/23
424 425
426 427 428
429 430 431
●
●
●
●
●
●
0.2
0.4
0.6
Nominal Censoring Proportion
0.8
1.0
0.0
●
0.0
0.0 0.0
0.8
0.8
●
0.6
●
●
●
[h]
●
1 2 3 4 5 6 7 8 9
0.4
0.4
●
●
Censoring Proportion from Simulation
●
0.6
●
● ●
0.4
Censoring Proportion from Simulation
●
1 2 3 4 5 6 7 8 9
0.2
● ●
0.2
0.8 0.6
1 2 3 4 5 6 7 8 9
0.2
Censoring Proportion from Simulation
●
1.0
1.0
(B)
1.0
(A)
0.0
0.2
0.4
0.6
0.8
Nominal Censoring Proportion 432
Comparing the nominal censoring proportions and the censoring proprotions from simulations using numerically computed θ. (A) Baseline hazard function h0 (t) = 0.25t−0.5 from W eibull(α = 0.5, ν = 4); (B) Baseline hazard function h0 (t) = 0.25 from W eibull(α = 1, ν = 4); (C) Baseline hazard function h0 (t) = 0.125t from W eibull(α = 2, ν = 4). (1) Censoring distribution ∼ W eibull(α, θ), Covariates ∼ N (0, 1); (2) Censoring distribution ∼ W eibull(α, θ), covariates from arbitrary distributions; (3) Censoring distribution ∼ W eibull(α, θ), binary covariates; (4) Censoring distribution ∼ W eibull(0.8, θ), Covariates ∼ N (0, 1); (5) Censoring distribution ∼ W eibull(0.8, θ), covariates from arbitrary distributions;(6) Censoring distribution ∼ W eibull(0.8, θ), binary covariates; (7) Censoring distribution ∼ U nif orm(0, θ), Covariates ∼ N (0, 1); (8) Censoring distribution ∼ U nif orm(0, θ), covariates from arbitrary distributions;(9) Censoring distribution ∼ U nif orm(0, θ), binary covariates;
19/23
433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
1.0
0.0
0
(A) β3 : 0.6
0.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−0.3 −0.4 0.1
0.6
α3 : 0.6
0.2
●
−0.2
● 0.2 ●
0.0
−0.1
Treatment.Size
−0.1 0.3
0.1
Bias of estimating treatment effect
0.2 0.1
β3 : − 0.2
0.1
0.3
α3 : 0.2
Bias of estimating treatment effect
β3 : 0.2
0.0
−0.1 −0.2
0.0
−0.3
−0.1
−0.4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Percentage of censoring
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 Perce
(B)
0.0
●
●
●
●
●
●
β3 : 0.6 ●
●
●
β3 : − 0.2
●
−0.1
●
●
●
●
●
●
●
●
●
●
−0.2
α3 : − 0.2
Bias of estimating treatment effect
0.1
−0.3
Treatment.Size ● 0.2
−0.4 0.1
0.6
0.0 ●
●
●
●
●
●
●
●
●
α3 : − 0.6
−0.1
Bias of estimating treatment effect
β3 : 0.2
●
−0.2 −0.3
●
●
●
●
●
●
●
●
●
●
0.4 0.2 0.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4 0.2 0.0
−0.4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Percentage of censoring
[h] Bias in estimating the conditional treatment effect with unmeasured confounder. Shape parameter for time-to-event α = 1.2. X-axis is nominal censoring proportion; Y-axis is the bias ∆ = γˆ − γ. Red dot represents the small treatment effect (γ = 0.2) and blue up triangle shape represents the large treatment effect (γ = 0.6); β3 is the effect of unmeasured confouder U on outcome;α3 is the effect of unmeasured confouder U on exposure variable
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 Perce 448 449 450 451 452 453 454
Appendix
455
A.1-Derivation of probability mass function for λi when n covariates Xi ’s ∼ B(pi )
457
Because n covariates Xi ’s take values of either 0 or 1, we have 2n possible realizations for covariate vector X. for example, when n=3, 8 possible realizations for < X1 , X2 , X3 > are: {< 0, 0, 0 >,< 1, 0, 0 >,< 1, 1, 0 >,< 1, 1, 1 >,< 1, 0, 1 >,< 0, 1, 0 >,< 0, 1, 1 >,< 0, 0, 1 >}.
20/23
456
458 459 460
461 462
We let l(n,2) be the lth realization of all 2n possible realizations for n binary variables Xi ’s, where l = 1, 2, · · · , 2n . for example, when n = 3, 1(3,2) =< 0, 0, 0 >, 2(3,2) =< 1, 0, 0 >,· · · ,8(3,2) =< 0, 0, 1 >. For simplicity in demonstration, we only examine 3 variables from now on. Further assume X1 ∼ B(p1 ), X2 ∼ B(p2 ), X3 ∼ B(p3 ) and associated regression parameters are {β1 /α, β2 /α, β3 /α}, scale parameter λi takes 8 possible values, corresponding to the order of the 8 realizations above:
463 464 465 466 467 468 469
{e−β0 /α , e−β0 /α−β1 /α , e−β0 /α−β1 /α−β2 /α , e−β0 /α−β1 /α−β2 /α−β3 /α , e−β0 /α−β1 /α−β4703 /α , e−β0 /α−β2 /α , 471
e−β0 /α−β2 /α−β3 /α , e−β0 /α−β3 /α }.
472
and the corresponding probabilities are {(1−p1 )(1−p2 )(1−p3 ), p1 (1−p2 )(1−
473
p3 ), p1 p2 (1 − p3 ), p1 (1 − p2 )p3 , (1 − p1 )p2 (1 − p3 ), (1 − p1 )p2 p3 , (1 − p1 )(1 − p2 )p3 }.
474 475
Thus, for n = 3, we can express the probability mass function of λ1 as
fλi (u) =
3 Y
l
(j)
476
(j)
1−l(3,2)
pj(3,2) (1 − pj )
j=1
where, u = exp(−β0 /α −
3 X
(j)
βj l(3,2) /α)
j=1 (j)
for l ∈ {1, 2, · · · , 8} and l(3,2) represents jth element in lth realization for {X1 , X2 , X3 }. For example, when l = 3, l(3,2) =< 1, 1, 0 >. Thus, for j = 1, (1) l(3,2)
(2) l(3,2)
(3) l(3,2)
= 1; for j = 2, = 1; for j = 3, = 0. We can write the expression for the probability mass function for λi more generally as
fλi (u) =
n Y
l
(j)
477 478 479 480 481
(j)
1−l(n,2)
pj(n,2) (1 − pj )
j=1
where, u = exp(−β0 /α −
n X
(j)
βj l(n,2) /α)
j=1 (j)
for l ∈ {0, 1, 2, · · · , 2n } and l(n,2) is the jth element in lth realization of binary variables < X1 , X2 , · · · , Xn >. 21/23
482 483
A.2-Derivation of individual censoring probability when censoring distribution ∼ unif orm(0, θ)
484 485
Assume time-to-event of ith individual T ∼ W eibull(α, λ), censoring time L ∼ unif orm(0, θ) and T ⊥ L. δ = I(T ≥ L) is the censoring indicator. δi = 1 when ith subject is censored. P(δ = 1|λ, α, θ) = P(L ≤ T ≤ ∞, 0 ≤ L ≤ θ) Z θ Z ∞ 1 = fλ (t|λ, α, θ)dtdl 0 θ l Z θ 1 l = exp(−( )α )dl λ 0 θ Z θ/λ l λ exp(−z α )dz , z = = θ 0 λ Z (θ/λ)α 1 λ = s α −1 exp(−s)ds , s = z α αθ 0 λ 1 = Γ( , (θ/λ)α ) αθ α where Γ(·, ·) is a lower incomplete gamma function.
486
A.3-Some example R code for computing censoring parameter θ Assume there are 4 independent covariates X = {X1 , X2 , X3 , X4 }. Baseline hazard function for survival time T is h0 (t) = 2.5t0.2 from W eibull(1.2, 0.5). (I) Censoring distribution is U nif orm(0, θ) and covariates are from Bernoulli distributions with different event probabilities (I.1) 4 variables X = {X1 , X2 , X3 , X4 } ∼ B(pi ) for i = 1, 2, 3, 4.
487 488
489 490
491 492
493
4
γ(θ) =
2 X
P(δ = 1|u, α, θ)fλi (ul ) − p
l 4
2 n (j) X Y (j) l λl 1 1−l = Γ( , (θ/λl )α ) pj(4,2) (1 − pj ) (4,2) − p αθ α j=1 l=1
P4 (j) where λl = exp(−β0 /α − j=1 βj l(4,2) /α). We use the following R code to numerically solve γ(θ) = 0 for a solution of θ.
494 495
496
p.1