Shared Frailty Model for Left-Truncated Multivariate Survival Data Henrik Jensen,1,2,∗ Ron Brookmeyer,3 Peter Aaby1 and Per Kragh Andersen2 1
Bandim Health Project, International Health Unit,
Department of Epidemiology Research, Statens Serum Institut, DK-2300 Copenhagen S, Denmark 2
Department of Biostatistics, University of Copenhagen, DK-2200 Copenhagen N, Denmark
3
Department of Biostatistics, The Johns Hopkins University, Baltimore, Maryland 21205, USA April 30, 2004
Summary. Left-truncated multivariate survival data are studied assuming a shared frailty model. A conditional model is studied where the interpretation of regression parameters is conditional on the frailty value. The survival of individuals in a group of related individuals depends on the frailty value and individuals with left-truncated life times have survived until their entry age and will carry information about their frailty value. This needs to be considered in the analysis. We propose a simple approximation to the likelihood function and compare it with an earlier proposed approximation. The conditional model is more relevant than a marginal model, for example the copula models, if the grouping variable may be a potential confounder. A real-life example from Guinea-Bissau, West Africa is used as illustration. Key words: Multivariate Survival Data; Left Truncation; Multiplicative Hazard Model; Shared Gamma Frailty; Conditional Model; Piecewise Exponential Model; Childhood Survival. ∗
email:
[email protected] 1
1.
Introduction
Survival data are said to be left-truncated when not all individuals in the data are observed from the time origin of interest. Analysing survival as a function of age, the time origin is birth and some individuals may only be observed from a certain age. From an analytical point of view the individuals are not at risk of dying before this age. This has to be taken into account in a survival analysis and is standard in the analysis of univariate survival data. Collection of survival times from children in low-income countries is often of such a character that different levels exist such as sampling cluster, village, family, and mother of related children. Inclusion of such interdependence in the analysis is needed either as a scientific interest per se or to test whether the interdependence has any consequences for the result of the analysis, e.g. magnitude of risk factors. A vast literature exists on regression analysis of this type of multivariate survival analysis (see, for example Hougaard, 2000 and the references therein) but the purpose of the present paper is not to review that literature. Rather, we shall focus on the shared frailty model in a situation with lefttruncated survival times. This does not seem to have been studied in detail elsewhere, though Clayton (1978), Nielsen et al. (1992), and Andersen et al. (1997) do touch upon the issue. We will discuss analysis of multivariate survival data using a situation with siblings having correlated survival times. Assume we have observations from j = 1, . . . , mi children coming from i = 1, . . . , n independent families. The regression analysis of multivariate survival data using a shared frailty model and a multiplicative hazard model is formulated as (see, for example Hougaard, 2000) λij (t; φ | zi ) = zi αij (t; φ) = zi αh0 (t; γ) exp(β T Wij (t)),
t > 0,
(1)
where t is age. The term zi > 0 is the family-specific frailty assumed to have density f ( · ; θ), 2
parameterised by the unknown parameter vector θ, which is a parameter of interest. By letting individuals in the same group share the same frailty variable, a positive statistical dependence is introduced. Conditional on the frailty, the survival times of individuals within a group are assumed to be independent. The αh0 (·) is the stratum-specific baseline hazard in stratum h. If the vector γ is infinite dimensional and αh0 (a; γ) is assumed to be completely unspecified, we are in a Cox regression type of model (semi-parametric model), while a finite-dimensional vector γ gives us a fully parametric regression model. Finally, β = (β1 , . . . , βp )T is a vector of unknown regression parameters describing the effect of the observed, possibly time-varying, covariates Wj (a) = (Wj1 (a), . . . , Wjp (a))T . The total vector of parameters of interest is denoted φ = (β, γ, θ)T . The frailty distribution f ( · ; θ) is the distribution of frailty values at time zero, in this case at birth. In survival data with left truncation the frailty distribution among survivors in a family is different as the distribution of survivors will tend to have lower frailty values. To determine the frailty distribution under left truncation in a shared frailty model is the purpose of the present paper. The gamma distribution has been elaborated extensively as the choice of frailty distribution, also called the shared gamma frailty model. Later in this paper the gamma distribution will be the choice, and other distributions are not discussed in great detail. The shared frailty model as formulated here is a conditional model and the interpretation of regression parameters is conditional on the frailty value, i.e. within families. This conditional model is related to the (highly) stratified Cox regression model λij (t; β) = αi0 (t) exp(β T Wij (t)), where the interpretation of the regression parameters is within stratum, i.e. within families. A disadvantage of the stratified model is that we can have extensive loss of information (data) if group sizes are small or contain no deaths, especially groups of size one are not included. The interpretation of regressions parameters in 3
these models are in contrast to marginal regression models for multivariate survival data such as the Generalised Estimating Equation (GEE) approach (Liang and Zeger, 1986) and copula models (see for example Glidden, 2000) where the interpretation of regression parameters is on the population level. The shared frailty model (1) has a corresponding copula formulation using the marginal distributions. Let S(x1 , . . . , xm ) be the joint survival function for the related survival times X1 , . . . , Xm , having marginal survival functions denoted by S1 , . . . , Sm . The shared gamma frailty model using copulas is ¡ ¢ 1 S(x1 , . . . , xm ) = Cθ S1 (x1 ), . . . , Sm (xm ) = (S1 (x1 )1−θ + · · · + Sm (xm )1−θ − (m − 1)) 1−θ ,
also known as the Clayton-Oakes model. Estimation can then be done using a two-stage estimation procedure, see Glidden (2000) and Andersen (2004). Briefly, the first stage is to estimate e.g. a marginal Poisson or Cox regression using GEE. The second stage consists of inserting the parameter estimates from the first stage into the joint likelihood function and estimate the parameter θ in this pseudo likelihood function. The handling of left truncation for the first stage uses the corresponding univariate methods. For the second stage the joint survival functions is divided by the joint survival function at the left truncation ages, as ¡ ¢ Cθ S1 (x1 ), . . . , Sm (xm ) ¡ ¢. S(x1 , . . . , xm | X1 > V1 , . . . , X1 > Vm ) = Cθ S1 (V1 ), . . . , Sm (Vm )
The estimation of the association parameter θ is then performed in this conditional joint survival function. If the scientific interest is solely on the effect of different covariates on survival on a population level and the association within groups is regarded as a nuisance, the marginal models are in general to be preferred to the frailty models. There seems to be a broad consensus of this viewpoint, see for example Pickles (1998). However, in epidemiology one is often interested in conditioning upon potential confounders, i.e. when a covariate is 4
assumed to be associated with both the outcome and the exposure. If the grouping variable (for example family) is thought to be a potential confounder, it seems more natural to make use of a conditional approach like the shared frailty model (1). The present paper is organised as follows. Section 2 introduces a motivating real-life example from Guinea-Bissau, West Africa, where left truncation of dependent life times is present. In Section 3 we present notation and discuss a formalised sampling scheme for enrollment of children in the Guinean cohort. Section 4 presents an exact marginal likelihood function using the shared frailty model with left-truncated survival times. In Section 5 two approximations to the exact likelihood are discussed. Section 6 presents a simulation study comparing the two approximations. Analysis of the Guinean data is presented in Section 7 and a general discussion concludes the paper in Section 8. 2.
Example
Since 1978, The Bandim Health Project (BHP) has undertaken longitudinal community studies in Guinea-Bissau, West Africa. An essential part of the project is a demographic surveillance system (DSS) which provides a basis for studies conducted at the BHP. As part of the project children below five years of age in 20 villages in five rural regions in GuineaBissau are followed. A mobile team visits each village twice a year, with an intended interval of six months between visits. Every house in the villages is visited to check for new pregnancies, births, deaths, follow-up on previously registered pregnancies, and check for in-migrants among children under five years of age. A child can enter the DSS with a date of entry before date of birth (as a pregnancy) or with a date of entry after date of birth (missed pregnancy or in-migrant). Once a child is registered it is visited routinely every six months until five years of age to monitor the health of the child. Children belong to different compounds within a village. Briefly, a compound is a residential and social unit of households and houses, typically built together, and all members will recognise one 5
person as the head of the compound. From January 1984 the BHP was the first to introduce vaccination against diphtheriatetanus-pertussis (DTP) for children older than three months of age. We wanted to investigate the association between the introduction of DTP vaccine and survival in the period 1984 to 1987, see Aaby et al. (2004) for further details. The effect of the DTP vaccine on mortality is investigated in the follow-up interval from the initial examination to the subsequent examination, six months (183 days) of follow-up, out-migration, or death whichever came first. At the initial visit children were vaccinated but some children did not get a vaccination because they were either absent, too ill to be vaccinated, or there were simply visits where no vaccine was given. Children were followed over an age interval from 2 to 14 months of age and were classified according to vaccination status at the initial examination, the vaccination status thus being a time-fixed covariate. Using age as time scale in the survival analysis children have left-truncated survival times. We wanted to control for compound in the analysis not only because of the possibility for dependence of survival times of children within compound but also because compound is a potential confounder. Different compounds will probably have different vaccination coverage as well as different mortality levels. Hence, a conditional approach to the analysis of these multivariate survival data seems relevant. 3.
Notation and Sampling Scheme
We will calculate the likelihood function for the data collected from a start date τ0 to an end date τ with respect to the order by which the data were collected. We will do this conditionally on events happening in calendar time; the events of enrollment of children from the family. Children can be enrolled at the same day but at different ages or enrolled at same age but at different days. When we wish to analyse the hazard as a function of age, we align children according to birth and then look forward in the age scale. This 6
means that we reshuffle our observations and as a consequence, the original sequencing of the events recorded in our data will be changed (see also Arjas, 1998). In writing up the likelihood function, we need to take into consideration the order in the calendar scale, as the probability of enrollment will likely depend on earlier enrollments from the same family. During the study period, [τ0 , τ ], we assume to have observations from j = 1, . . . , mi children coming from i = 1, . . . , n independent families. Let Xij > 0 be a random variable describing the survival time (age) of child (i, j), and let Uij > 0 be the right censoring age eij = min(Xij , Uij ) be the observed survival times. Together with X eij for child (i, j). Let X
eij is a censoring or a death we also observe an indicator function Dij indicating whether X ( 0 if Xij > Uij , i.e. a censoring, Dij = 1 if Xij ≤ Uij , i.e. a death. Furthermore, let Vij ≥ 0 be the left truncation age for enrollment into the study, i.e. the age at which the child first came under observation. In what follows we will consider only one family, as families are assumed to be independent, and index i is dropped in the formulas. In the calendar time scale assume that we for a family have s ordered and distinct entry times τ0 ≤ ν1 < · · · < νs ≤ τ . At each entry time νl we enroll ml ≥ 0 new children into our P cohort and m = sl=1 ml is the total number of children enrolled from the family during the study period [τ0 , τ ]. We assume that children are enrolled according to the following expanded prevalent sampling criterion (see Wang, Brookmeyer and Jewell, 1993): Dynamic Prevalent Family Sampling. Conditional on a fixed νl , a child from the family is randomly selected from the children in the family who were born before time νl , have not died prior to νl , and have not been selected before time νl . Furthermore, an unborn child registered as a pregnancy at time νl is randomly selected. 7
The entry times ν1 , . . . , νs are assumed deterministic. Let (TB(j) , TIN(j) , TOUT(j) , Dj , W j (t)) be the observed data for each child, j = 1, . . . , m, where TB(j) is the date of birth, TIN(j) the date of entry into study, and TOUT(j) is the date of exit from study. If a child is registered as a pregnancy and born alive we let TIN(j) = TB(j) , for TB(j) < t , otherwise TIN(j) = ∞. We have TB(j) ≤ TIN(j) < TOUT(j) , TIN(j) ∈ {ν1 , . . . , νs , TB(j) }, and TOUT(j) ≤ τ . The variable Dj is indicating whether the child died or not at TOUT(j) , and W j (t) is explanatory covariate information possibly depending on time t. We have the relationships ej = TOUT(j) − TB(j) X
and Vj = TIN(j) − TB(j) .
Let the random variable Ml be the number of children enrolled at time νl , i.e. ml is the realised value of the random variable Ml . Denote by Al , l = 1, . . . , s, the sampling event at time νl of ml children according to our sampling criterion above, i.e. at time νl we enroll ml new children who have survived until νl , meaning Xj > νl − TB(j) which is the same as enrolling ml children with ages Vl1 , . . . , Vlml . More precisely Al = {enrollment of ml new children at time νl , with ages Vl1 , . . . , Vlml } = {Ml = ml , Xl1 > Vl1 , . . . , Xlml > Vlml }. Note that the event Al is dependent on the earlier events Ak , k < l, as the chance of enrollment may depend on how many children were included at times before νl , which again depends on how many children from the family were alive at time νl . However, the survival of children from different Al ’s is assumed conditionally independent given z. Let Nj (t) = I(TOUT(j) ≤ t, Dj = 1) be the observed right censored and left-truncated counting process in calendar time, count8
ing one if the child is observed to have died at time t. Let µj (t; φ | z, TB(j) ),
(2)
t > TB(j)
be the hazard of dying in calendar time given z and TB(j) for child j parameterised by the unknown parameter vector of interest φ. We assume that we have a simple relationship between the age-dependent hazard (1) and the calendar time hazard (2) by µj (t; φ | z, TB(j) ) = z αj (t − TB(j) ; φ),
t > TB(j) ,
(3)
i.e. the hazard of dying in calendar time only depends on the age of the child at time t. The assumption is rather strict, as the mortality could well change over the years, but this could be handled using time-varying covariates. Let ej , Dj = 0) Cj (t) = I(TOUT(j) ≤ t, Dj = 0) = I(t − TB(j) ≤ X
be the right censoring counting process in calendar time counting one if the child is observed to have been censored at time t. We also define a hazard of being censored given z in the calendar time scale as ψj (t; φ, ζ | z, TB(j) ),
t > TB(j) ,
parameterised by the parameter vector of interest φ and a nuisance parameter vector ζ. Furthermore, let Yej (t) = I(TIN(j) < t ≤ TOUT(j) ) = Yj (t − TB(j) )
(4)
be the indicator function in the calendar time indicating whether child j is included in the study and at risk of dying at time t−. Let H(t) = {j : TIN(j) ≤ t} be the index set of children enrolled at or before time t, so H(t) = H(νl ) for t ∈ [νl , νl+1 ).
9
4.
Exact (Partial) Likelihood for Shared Frailty
We will calculate the family’s contribution to the conditional likelihood function given the frailty z and the sampling event ∪sl=1 Al , denoted by L[ν1 ,τ ] ( ω | z, ∪sl=1 Al ) where ω = (φ, ζ)T . By repeated use of the definition of conditional probability we get s
s
l=1
l=1
L[ν1 ,τ ] ( ω | z, ∪ Al ) = L[νs ,τ ] ( ω | z, ∪ Al , Fνs − )
s−1 Y
L[νl ,νl+1 ) ( ω | z, ∪ Ak , Fνl − ),
l=1
k≤l
where Fνl − is the survival and censoring history of the H(νl−1 ) children at time νl −, and we have used the assumption that given z, the children in the family are independent, here explicitly that children entering before time νl are conditionally independent of Ak for k > l given z. The likelihood terms in the product are involving more and more children as time progresses. By using the conditional multinomial probability argument (see, for example Andersen et al. (1993), page 99) for l = 1, . . . , s − 1 we get
π
t∈[νl ,νl+1 )
L[νl ,νl+1 ) ( ω | z, ∪ Ak , Fνl − ) = k≤l Y ¡ ¢∆Nj (t) ¡ ¢∆Cj (t) µj (t; φ | z, TB(j) )Yej (t) dt ψj (t; φ, ζ | z, TB(j) )Yej (t) dt
j∈H(νl )
µ ¶1−∆N. (t)−∆C. (t) X ¡ ¢ × 1− µj (t; φ | z, TB(j) ) + ψj (t; φ, ζ | z, TB(j) ) Yej (t) dt j∈H(νl )
and a similar term for l = s, and · denotes summation. When a pregnancy results in a miscarriage or a stillbirth we have that Yej (t) = 0 for all t. Evaluating the product integral,
multiplying the contributions from each interval, and rearranging terms we get the full conditional likelihood function s
L[ν1 ,τ ] ( ω | z, ∪ Al ) l=1 ¾ ½ X m Z τ m Y Dj e µj (u; φ | z, TB(j) ) Yj (u) du µj (TOUT(j) ; φ, | z, TB(j) ) exp − = j=1
j=1
10
τ0
×
m Y
1−Dj
ψj (TOUT(j) ; φ, ζ | z, TB(j) )
½ X m Z exp − j=1
j=1
s
τ
τ0
ψj (u; φ, ζ | z, TB(j) ) Yej (u) du s
= L1 (φ | z, ∪ Al ) × L2 (φ, ζ | z, ∪ Al ). l=1
l=1
¾ (5)
The function L1 (φ | z, ∪sl=1 Al ) is the contribution from the mortality hazards while the function L2 (φ, ζ | z, ∪sl=1 Al ) is from the censoring hazards. As we cannot observe the frailty z, we will have to integrate z out of the full conditional likelihood (5) to get a marginal conditional likelihood function, i.e. we need to evaluate the integral Z
s
s
s
l=1
l=1
l=1
L1 (φ | z, ∪ Al ) L2 (φ, ζ | z, ∪ Al ) f (z; θ | ∪ Al ) dz.
z
(6)
We need to consider the dependence of the censoring on the frailty variable z because otherwise we cannot tell what the result or effect will be when integrating z out of the likelihood. Furthermore, we need the conditional frailty distribution of z given the sampling event ∪sl=1 Al . For the censoring hazard we need to make the assumption that conditional on z, censoring is non-informative of z, i.e. ψj (u; φ, ζ | z) does not depend on z along with the usual assumption that the right censoring is non-informative of φ. Also the assumption is needed that conditional on z, censoring and left truncation are independent, i.e. the death hazard for a child is the same whether we observe the child or not and regardless of when it was included (see for example Andersen et al., 1993). Under these assumptions we have that the factors involving the censoring hazards in the full conditional likelihood (5) do not depend on neither φ nor z, hence not on θ, and we will consider the first factor L1 (φ | z, ∪sl=1 Al ) as a partial conditional likelihood function. Thus the marginal (partial) conditional likelihood function (6) becomes s
L(φ | ∪ Al ) = l=1
Z
s
s
l=1
l=1
L1 (φ | z, ∪ Al ) f (z; θ | ∪ Al ) dz.
z
11
(7)
Before considering the conditional frailty distribution of z given the sampling event ∪sl=1 Al we carry out the transformation from the calendar to the age scale under the above assumptions on censoring. From (3) and (4) we get L1 (φ |
z, ∪sl=1 Al )
=
m Y ¡ j=1
By Bayes’ rule we have
½ m Z X ¢Dj e z αj (Xj ; φ) exp −z j=1
∞
0
¾ αj (s ; φ) Yj (s) ds .
(8)
s
P( ∪ Al | z)f (z; θ)
s
f (z; θ | ∪ Al ) = R l=1
l=1 s
(9)
.
P( ∪ Al | z)f (z; θ) dz z l=1
The first factor of the numerator is s
P( ∪ Al | z) = l=1
= =
s Y
l=1 s Y
l=1 s Y
P(Al | z, ∪ Ak ) = k Vl1 , . . . , Xlml > Vlml | z, ∪ Ak ) k Vl1 , . . . , Xlml > Vlml | z, ∪ Ak , Ml = ml )P(Ml = ml | z, ∪ Ak ) k Vlml | z, Ml = ml )P(Ml = ml | z, ∪ Ak ), k Vl1 , . . . , Xlml > Vlml | z, Ml = ml ) is a product of ml factors as the children are conditionally independent given z.
The
P(Ml = ml | z, ∪k 0, c = 1, . . . , q and h = 1, . . . , k. The parameter is kq-dimensional and we write γ T = (γ T1 , . . . , γ Tk ) where γ Th = (γh1 , . . . , γhq ), h = 1, . . . , k. We will parameterise by λ = log(γ). By direct integration over z in the complete data likelihood function (11) we 15
find that the contribution from one family to the marginal (partial) log-likelihood function becomes l(β, λ, θ) =
D . −1 X
log
¡1 ¢ 1 ¡1 ¢ ¡1 ¢ 1 e + r + log + ΛV (β, λ) − ( + D.) log + ΛX (β, λ) θ θ θ θ θ
r=0 q m X X
+
j=1 c=1
ej ) + Dj λhc Ic (X
m X j=1
(12)
ej ), Dj β T Wj (X
where Ic (a) = I(ac−1 < a ≤ ac ) is the indicator function for a belonging to the interval (ac−1 , ac ]. We will refer to (12) as the updated frailty log-likelihood function or as the updated frailty model as we have updated the frailty distribution given entry times. Without frailties we are in a standard Poisson regression model with log-likelihood function l(β, λ) =
q m X X j=1 c=1
ej ) + Dj λhc Ic (X
m X j=1
e ej ) − ΛX Dj β T Wj (X V (β, λ),
(13)
Estimation can be done by maximising the marginal log-likelihood function using the first and second derivatives of the log-likelihood in a Newton-Raphson iterative procedure. A likelihood ratio test for θ = 0 can be performed by using (12) and (13). As θ = 0 is on the boundary of the parameter space, use of a one-sided p-value from the corresponding χ2 distribution is recommended Liang and Self (1996). Returning to the proposal by Nielsen et al. (1992) of using the appropriate risk function ej ) to account for left truncation, the log-likelihood function for one Yj (a) = I(Vj < a ≤ X
family becomes
l(β, λ, θ) =
D . −1 X
log
¡1 ¢ 1 ¡1¢ ¡1 ¢ 1 e + r + log − ( + D.) log + ΛX V (β, λ) θ θ θ θ θ
r=0 q m X X
+
j=1 c=1
ej ) + Dj λhc Ic (X
m X j=1
ej ). Dj β T Wj (X
(14)
which we will refer to as the naive log-likelihood function or as the naive model. Note that without left truncation the naive and updated frailty model are identical. 16
e
Having time-varying covariates the family-integrated hazards ΛX and ΛV become cumbersome. Let us consider one dichotomous time-varying covariate having values 0 and 1. The observation would be the transition time from 0 to 1, say at TW (j) in calendar time scale and the age at transition is denoted by aj = TW (j) − TB(j) . Let Wj1 represent time-fixed covariates and Wj 0 the dichotomous time-varying covariate with corresponding regression parameters β and β 0 respectively. If transition has not occurred before the end e
of the observation period (TWj > τ ) we let aj = ∞. The family-integrated hazard ΛX becomes e X
Λ (β, λ) =
m X
βT Wj
e
Z
0
j=1
e j ∧ aj X
αh0 (u; λ) du +
Z
ej X
0
e j ∧ aj X
αh0 (u; λ)eβ du.
Having two dichotomous time-varying covariates Wj01 and Wj02 with transition ages aj1 and aj2 and regression parameters β10 and β10 , respectively, we need four integrals per child e X
Λ (β, λ) =
m X j=1
+
Z
βT Wj
e
Z
0
e j ∧ aj1 X
e j ∧ aj1 ∧ aj2 X
e j ∧ aj1 ∧ aj2 X
αh0 (u; λ) du + β20
αh0 (u; λ)e du +
Z
ej X
Z
e j ∧ aj2 X
e j ∧ aj1 ∧ aj2 X
0
αh0 (u; λ)eβ1 du 0
0
αh0 (u; λ)eβ1 eβ2 du.
aj1 ∨ aj2
Thus we need to calculate the risk time for each child in each of the possible combinations of values for the time-varying covariates. If only allowing dichotomous time-varying covariates we generally need to calculate 2p risk times if we have p time-varying covariates. Similarly for ΛV (β, λ), but here it is important to notice that we actually would need the time of transition before the child was enrolled in our study. 6.
Simulations
To evaluate the updated frailty model and to compare it with the naive model we conducted a simulation study. The simulated model was λij (a | zi ) = zi γ eβWij ,
i = 1, . . . , 500, j = 1, . . . , m. 17
(15)
Every simulation consisted of 500 replications of a data set of 500 groups, where the group size was either 2 or 4, i.e. m = 2 or 4 in (15). For all simulations we chose the hazard γ = exp(−6) (λ = log γ = −6) and the regression parameter β = 1. For m = 2 we generated 500 independent exponentially distributed random variables with hazard γ and another 500 with hazard 2γ, implying that the covariates Wij = 0 for the first 500 observations and Wij = log(2) for the last 500 observations. All the 1000 random variables and corresponding covariates were sorted randomly and in that order given group numbers i = 1, 1, 2, 2, . . . , 500, 500. Then 500 independently gamma-distributed variables zi , i = 1, . . . , 500, were generated with mean one and variance θ. The variance was varied in different simulations to be 0.5, 1, or 2. The random variables within a group i were then divided by zi to obtain survival times from the model (15). We have not imposed any censoring in the simulations. The simulations were done with four degrees of left truncation by randomly selecting 30%, 50%, 75%, or 100% of the individuals to be independently allocated an entry time, where the entry time was exponentially distributed with hazard equal to 0.001. The case with 100% is equivalent to a prevalent cohort, where each observation has an entry time greater than zero. If the entry time was larger than the corresponding survival time the observation was removed to imitate that an individual had died before entry. This implies that some pairs contribute with two observations, others with one and some do not contribute as both have entry times later than their survival times. Every data set was analysed with the standard Poisson log-likelihood, (13), our updated frailty log-likelihood, (12), and the naive log-likelihood (14). Furthermore, we analysed every data set using only the pairs in which both survived their potential left truncation times, implying that we only used groups with two observations included in the data set. This would be the situation in e.g. a twin study design having both twins included for all pairs. The simulations were done similarly for m = 4. Tables 1 to 3 present some of the 18
simulation results. [Table 1 about here.] [Table 2 about here.] [Table 3 about here.] The tables display several statistics. The sample mean (SM ) is the average of 500 point estimators and the bias is the SM minus the true value. The mean estimated variance (M EV ) is the average of the 500 estimated variances of the corresponding point estimate. The sample variance (SV ) is the variance of the 500 point estimates. The total coverage percentage is the percent of times the null hypothesis θ = 0.5 (or 1 or 2), β = 1, or λ = −6 is accepted at the 5% level. When e.g. 30% of the observations were selected to be left-truncated, some observations are removed and we report the actual sizes of the data sets analysed, by showing median data size in each simulation. The median data size is increasing with increasing frailty variance, i.e. with increasing heterogeneity. This is because the mass of the gamma distribution for increasing variance is moving towards zero implying relatively more z < 1 and hence longer survival times. We also report the median percent actually having a delayed entry in the 500 data sets, i.e. how many of the observations analysed had an entry time greater than zero. The simulation study shows at least three important results which depend on the structure of the data analysed. Firstly, when analysing left-truncated shared frailty data with a mixture of group sizes the updated frailty model is not good for the estimation of the frailty variance compared with the so-called naive model. Both methods seem to estimate the regression parameter correctly, but not the baseline hazard. The updated frailty model consistently underestimates the frailty variance and overestimates the baseline hazard, 19
while the naive model underestimates the baseline hazard. It is worth noting that even if the baseline hazard is underestimated or overestimated, the relative difference between the two groups, i.e. the regression parameter is well estimated. Secondly, when the data only consist of complete groups the updated frailty model works well for all the parameters. The naive model performs well for the estimation of the regression parameter in moderately left-truncated data but consistently overestimates the frailty variance and underestimates the baseline hazard. Thirdly, neither the updated frailty nor the naive model is generally suitable for analysing a prevalent cohort where all individuals have delayed entry, though the updated frailty model can be used if interested in the regression parameter. The assumption that the probability of enrollment p(z) is independent of z is an oversimplification and the proposed approximated likelihood function is not in general performing well. In the simulations the left truncation times, say V , were exponentially distributed with hazard, say α. The probability of enrollment in the data set with 50% left truncation is first
1 2
of not being left-truncated (and therefore included) plus
1 2
times the probability
that the survival time, say X, given the frailty z, is greater that V i.e. p(z) =
1 1 1 1 α + P(X > V | z) = + , 2 2 2 2 α + zγeβW
clearly depending on z. Even in a simple situation where the left truncation time is assumed to be a fixed time c the probability of entering is dependent on z as we have p(z) =
1 1 1 1 βW + P(X > c | z) = + e−zγe c . 2 2 2 2
Surprisingly, the naive model that does not modify the conditional frailty distribution performs very well for the estimation of the frailty variance and the regression parameter. In the analysis of complete groups the probability of enrollment is one as we by the design have included only the groups in which all members survived their entry time and for this design the updated frailty model works fine. 20
7.
Analysis of the Guinean Data
In the study period, 20 villages were followed and 1,657 children were identified to be included in the analysis. Fifty-seven (3.4%) of the 1,657 children died during follow-up. A total of 51% (847/1,657) were classified as vaccinated with DTP at the initial examination. The 1,657 children belonged to 562 different compounds of which 36% (201/562) had one child only. Age is split into age bands of two months. Table 4 reports the results of analysing the data using several models. [Table 4 about here.] We have used the updated frailty model, the naive model, a marginal Clayton-Oakes model (1) where the marginals are assumed to be a Poisson regression and used the two-stage estimation technique of Andersen (2004). Furthermore we used a Poisson regression model assuming independence between life times and a Cox proportional hazards model stratified on compound. In our example the stratified Cox model uses only 17% (257/1657) of the children. The frailty parameters are insignificant and the updated frailty model gives a lower frailty variance than the naive model, as expected from the simulation studies. The estimated frailty parameter from the Clayton-Oakes model is closer to the updated frailty than the naive model. The regression parameter estimates are all close to the standard Poisson regression model assuming independence and the results from the Cox model is fairly different from the frailty models. It should be mentioned that we could have analysed these data using time since initial examination as the time scale and controlling for age at initial visit as a covariate. Doing so and using the naive model the mortality rate ratio becomes 2.04 (95%-CI: 1.12–3.70) and θˆ = 0.49 (p = 0.16), not far from the result in Table 4. Epidemiologically, it is surprising and worrying that DTP-vaccinated children have higher mortality than DTP-unvaccinated children. We would expect it to be the other way round, see Aaby et al. (2004) for further discussion of this issue. As the data are 21
observational there is always the possibility of selection bias and uncontrolled confounding. As discussed earlier different compounds will probably have different vaccination coverage as well as different mortality levels and that was the reason for choosing the conditional approach. However the frailty parameter is small and we conclude that compound does not seem to confound our analysis. 8.
Discussion
Most critically in the deduction of the updated frailty model we have assumed that the probability of enrollment of a child into the study, pl (z), is independent of z. The probability of entering the study depends on the risk of dying, which again is dependent on the frailty level, z, of the family. Thus, pl (z) is likely to carry information about the frailty distribution and it is a too simple assumption that pl (z) is independent of z. Thus, it seems necessary to model the probability of enrollment in the likelihood function. The birth probability, πlb (z), is assumed to be independent of z, i.e. the probability that a child is born into a family is independent of the shared frailty of that family. This might be violated if high-risk families try to replace previously dead children with new ones. We have also assumed that the times at which new individuals are entering the study, νl , are deterministic. This is not necessarily a severe restriction as entering of new individuals into a study often is predetermined at specific dates from a study protocol or maybe only once if studying e.g. a prevalent cohort. If this is not the case one must further consider this randomness in the model. The updated frailty likelihood is not limited to the assumption of a shared frailty model or the choice of the gamma distribution. The problem originates from the sampling and any expanded frailty model, e.g. correlated frailty models (see for example Petersen, 1998), or any choice of frailty distributions will not make things easier. The frailty models do not seem to be appropriate when analysing multivariate survival data in which all survival times are left-truncated. Alternative models should be used and 22
contrasted with the results from the frailty analysis. In conclusion, the updated frailty model should be used in a study which by design is enrolling only complete groups. The naive model can be used in situations with not all groups being complete, at least for the estimation of frailty and regression parameters. The handling of time-varying covariates is in general intractable for the updated frailty model while standard in the naive model. It seems relevant to gather practical experience with alternative models for left-truncated multivariate survival data that quantifies the interdependence, for example the copula models for multivariate survival data. Attention should be given to the difference in interpretation of regression parameters and to the handling of time-varying covariates. Acknowledgements This research was supported by grants from the Danish International Development Assistance (90803 and 91135) and by the Danish National Research Foundation as part of a grant to the Danish Epidemiology Science Centre. The authors thank Elisabeth Wreford Andersen for letting us use her program for estimating in the Clayton-Oakes model and Liselotte Pedersen for analysing the Guinean data using this model.
23
References Aaby, P., Jensen, H., Gomes, J., Fernandes, M. and Lisse, I. M. (2004). The introduction of diphtheria-tetanus-pertussis vaccine and child mortality in rural Guinea-Bissau: An observational study. International Journal of Epidemiology 33, 374–380. Andersen, E. W. (2004). Two-stage estimation in copula models used in family studies. Lifetime Data Analysis (in press). Andersen, P. K., Borgan, Ø., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. Andersen, P. K., Klein, J. P., Knudsen, K. M. and Palacios, R. T. (1997). Estimation of variance in Cox’s regression model with shared gamma frailties. Biometrics 53, 1475–84. Arjas, E. (1998). Real time approach in survival analysis. In Armitage, P. and Colton, T., editors, Encyclopedia of Biostatistics, volume 5, pages 3737–38. John Wiley and Sons, Chichester. Clayton, D. G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 141–51. Glidden, D. V. (2000). A two–stage estimator of the dependence parameter for the ClaytonOakes model. Lifetime Data Analysis 6, 141–56. Hougaard, P. (2000). Analysis of Multivariate Survival data. Statistics for biology and health. Springer-Verlag, New York. Klein, J. P. (1992). Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics 48, 795–806. Liang, K.-Y. and Self, S. G. (1996). On the asymptotic behavior of the pseudo likelihood ratio test statistic. Journal of the Royal Statistical Society B 58, 785–96. 24
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. Nielsen, G. G., Gill, R. D., Andersen, P. K. and Sorensen, T. I. A. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19, 25–43. Petersen, J. H. (1998). An additive frailty model for correlated survival times. Biometrics 54, 646–61. Pickles, A. (1998). Generalized estimating equations. In Armitage, P. and Colton, T., editors, Encyclopedia of Biostatistics, volume 2, pages 1626–37. John Wiley and Sons, Chichester. Sufian, A. J. M. and Johnson, N. E. (1989).
Son preferences and child survival in
Bangladesh: a new look at the child survival hypothesis. Journal of Biosocial Science 21, 207–16. Wang, M.-C., Brookmeyer, R. and Jewell, N. P. (1993). Statistical models for prevalent cohort data. Biometrics 49, 1–11.
25
Table 1 Summary of 500 simulations with 500 groups of size 2 and 50% left-truncated. Parameter
Model
Bias
SM
M EV
SV
Coverage (%)
Median data size: 647. Median left-truncated: 23% θ = 0.5
Naive Updated frailty
0.5173 0.4665
0.0173 -0.0335
0.0039 0.0026
0.0040 0.0029
94 88
β=1
Poisson Naive Updated frailty
0.9691 1.0111 1.0028
-0.0309 0.0111 0.0028
0.0129 0.0229 0.0237
0.1027 0.0237 0.0242
57 94 94
λ = −6
Poisson Naive Updated frailty
-6.8083 -6.0655 -5.9405
-0.8083 -0.0656 0.0595
0.0029 0.0075 0.0085
0.0290 0.0082 0.0092
0 87 91
Median data size: 729. Median left-truncated: 31% θ=2
Naive Updated frailty
2.0486 1.7339
0.0486 -0.2661
0.0184 0.0108
0.0173 0.0111
95 29
β=1
Poisson Naive Updated frailty
1.1697 1.0156 1.0192
0.1697 0.0156 0.0192
0.0115 0.0299 0.0306
9.5798 0.0314 0.0322
7 94 95
λ = −6
Poisson Naive Updated frailty
-13.3268 -6.1862 -5.7094
-7.3268 -0.1862 0.2906
0.0026 0.0143 0.0180
4.7860 0.0170 0.0188
0 65 42
SM : sample mean. M EV : mean estimated variance. SV : sample variance. Coverage: Coverage probabilities in percent for 95% confidence intervals.
26
Table 2 Summary of 500 simulations with 500 groups of size 4 and 100% left-truncated. Parameter
Model
SM
Bias
M EV
SV
Coverage (%)
Median data size: 594 θ = 0.5
Naive Updated frailty
0.5002 0.3687
0.0002 -0.1313
0.0043 0.0013
0.0042 0.0014
96 6
β=1
Poisson Naive Updated frailty
0.7911 0.9176 1.0111
-0.2089 -0.0824 0.0111
0.0147 0.0240 0.0292
0.1083 0.0248 0.0304
50 91 95
λ = −6
Poisson Naive Updated frailty
-7.1117 -6.2907 -5.6897
-1.1117 -0.2907 0.3103
0.0028 0.0078 0.0185
0.0371 0.0081 0.0194
0 10 38
Median data size: 917 θ=2
Naive Updated frailty
1.8995 1.1511
-0.1005 -0.8489
0.0189 0.0036
0.0200 0.0051
87 0
β=1
Poisson Naive Updated frailty
0.5681 0.9076 1.0096
-0.4319 -0.0924 0.0096
0.0092 0.0167 0.0183
4.3171 0.0185 0.0192
8 87 94
λ = −6
Poisson Naive Updated frailty
-13.9092 -6.7856 -4.4080
-7.9092 -0.7856 1.5920
0.0020 0.0136 0.1348
4.6795 0.0148 0.0901
0 0 0
SM : sample mean. M EV : mean estimated variance. SV : sample variance. Coverage: Coverage probabilities in percent for 95% confidence intervals.
27
Table 3 Summary of 500 simulations with 500 groups of size 4 and 50% left-truncated. Only complete groups are included. Parameter
Model
Bias
SM
M EV
SV
Coverage (%)
Median data size: 384. Median left-truncated: 26% θ = 0.5
Naive Updated frailty
0.6441 0.4898
0.1441 -0.0102
0.0126 0.0052
0.0110 0.0047
82 95
β=1
Poisson Naive Updated frailty
0.9498 0.9918 0.9908
-0.0502 -0.0082 -0.0092
0.0218 0.0314 0.0308
0.1495 0.0329 0.0314
59 93 94
λ = −6
Poisson Naive Updated frailty
-7.2270 -6.3535 -6.0091
-1.2270 -0.3535 -0.0091
0.0050 0.0156 0.0225
0.0658 0.0155 0.0218
0 20 96
Median data size: 732. Median left-truncated: 39% θ=2
Naive Updated frailty
2.5861 1.9877
0.5861 -0.0123
0.0570 0.0241
0.0467 0.0234
28 94
β=1
Poisson Naive Updated frailty
0.8633 0.9948 1.0021
-0.1367 -0.0052 0.0021
0.0115 0.0183 0.0178
3.3363 0.0188 0.0184
15 95 94
λ = −6
Poisson Naive Updated frailty
-14.3010 -7.2532 -6.0173
-8.3010 -1.2532 -0.0173
0.0027 0.0236 0.0679
4.6017 0.0339 0.0627
0 0 95
SM : sample mean. M EV : mean estimated variance. SV : sample variance. Coverage: Coverage probabilities in percent for 95% confidence intervals.
28
Table 4 Comparison of the naive and updated frailty model, the Clayton-Oakes model, and models assuming independence.
Naive
βb
0.752
b SE(β)
p-value
0.503
b SE(θ)
−2 log Q
2.12 (1.19–3.80)
θb
0.297
0.617
0.96
0.16
Updated frailty
0.764
0.300
2.15 (1.19–3.87)
0.382
0.530
0.71
0.20
Clayton-Oakes
0.750
0.311
2.12 (1.15–3.90)
0.331
0.643
Poisson (independence)
0.744
0.294
2.10 (1.18–3.74)
Stratified Cox
0.475
0.410
1.61 (0.72–3.59)
Model
MR (95%–CI)
SE: Standard Error MR: Mortality rate ratio for DTP-vaccinated versus DTP-unvaccinated CI: confidence interval −2 log Q: Likelihood ratio test statistic for θ = 0 with one-sided p-value.
29