Partial Likelihood Estimation of a Cox Model with Two Nested Random Effects: an EM Algorithm based on Penalized Likelihood Guillaume Horny∗ February 2004
Abstract A general approach to estimate Mixed Proportional Hazard models with two nested random effects using the EM algorithm within a partial likelihood framework is presented. The mixing distributions are assumed to have known Laplace transforms, and we propose to make use of the penalized partial likelihood inside the EM algorithm in case of two gamma frailties. We furthermore provide a Monte Carlo study to assess the behaviour of the estimator in realistic sized samples. The results indicate an important decrease of the computing time. Keywords: EM algorithm, penalized likelihood, partial likelihood, frailties, duration analysis JEL Classification: C13, C14, C41.
1
Introduction
In this paper, I propose a general framework to estimate Mixed Proportional Hazard models with a fixed number of risks where unobserved heterogeneity is settled at two different levels. This kind of models describes a population divided in clusters and sub-clusters, and allows unobserved characteristics to be handled precisely. Many econometric works in duration analysis take account of the unobserved heterogeneity. Indeed, it is not reasonnable to assume that we observe all the determinants leading to a transition, and furthermore that ∗ BETA-Theme, University Louis Pasteur (Strasbourg I), 61 avenue de la Forˆet Noire, 67085 Strasbourg Cedex, France.
[email protected]
1
the recorded explanatory variables are free of measurement errors. Studies handling a single random effect are nowadays widespread, but while it is common in applications to consider the observed heterogeneity to reside at different levels by allowing some covariates to be defined at the individual level and some other to be defined at a more aggregated level (for example dummies indicating groups of observations or time trend with some values shared among different observations), specification of unobserved heterogeneity is in general made at only a single level. Ignoring some of the unobserved heterogeneity can lead to substantial biases (see Moulton, 1986, for some case studies in linear models), but modelizing it as settled at different levels is not an easy task and raises some awkward problems during the inference. Indeed, it involves further multidimensional integrals which typically do not admit analytical solutions. A few works in the biometric and demographic litterature handle two levels of clustering. Manda and Meyer (2002) consider a model in discrete time, Yau (2001) and Sastry (1997) study two nested random effects with respectively log-normal and gamma mixing distribution. Inference is based on the Expectation Maximization algorithm (hereafter refered as the EM algorithm, see Dempster et al, 1977, for the first general formulation) in this last study. The EM algorithm is ideally suited for mixture models due to their missing data structure and used in numerous works involving the Cox model with one frailty, such as Clayton and Cuzick (1985), Gill (1985), Lancaster (1990, p.264-268) and Parner (1997). However, this theoretical attractivity it balanced by many numerical drawbacks: convergence of the EM algorithm is usually slow, sometimes fails, is sensitiv to the choice of the starting values and computation of the algorithm is time consumming. 1 2 3 4 These lead 1
Dempster et al. (1977) show that the convergence rate is linear and depends on the amount of information in the data. Incompleteness of the data is usually large in the MPH setting, and convergence is typically slower than the usual quadratic rate. 2 There are two different cases of non-convergence in the litterature. Bolstad and Manda (2001) pointed one case when the variance of the random effect becomes large enough to raise numerical issues, and Lancaster (1990, p.267) described the case of a bounded likelihood with a unobserved heterogeneity variance tending to zero. 3 In the case of a likelihood unbounded on the edge of the parameter space, Ng et al. (2004) emphasize the importance of the starting values when they are settled too close to the boundary. 4 Step E can have no analytical solutions, which have to be evaluated using Monte Carlo methods. This Monte Carlo EM (MCEM) algorithm is described in Wei and Tanner (1990), but simulations increase dramatically the computational cost and introduce furthermore a Monte Carlo error. Nevertheless, the EM algorithm asks for a large number of iterations even in far easy models.
2
Therneau et al. (2003) to investigate inference in the one frailty setting using penalized partial likelihood and to obtain an estimator equivalent to the one provided by the EM algorithm, without all these numerical problems. In this paper, I present a general framework for inference using the EM algorithm in a Mixed Proportional Hazards models setting (hereafter refered as MPH models, see Van den Berg, 2001, for a survey) involving two nested random effects, as described in section 2. This method has the advantage to transform the estimation of a single complicated model in inference in two MPH models with a single frailty, which are easily manageable, and is presented in section 3. I describe how to make use of the penalized partial likelihood in the setting of two gamma mixing distributions in section 4. That is, we stay in the theoretical convenient framework of the EM algorithm while taking advantage of the numerical stability and quickness of the penalized likelihood. Finally, this method is compared in section 5 to the speeded-up EM algorithm described in Sastry (1997) using a small Monte Carlo study.
2
Mixed Proportional Hazard model with two nested random effects
Consider a nested frailty model as the one described in Bolstad and Manda (2001), Sastry (1997) and Yau (2001). It belongs to the more general family of MPH models, whith the particularity that the unobserved heterogeneity term is here written as the product of two nested random effects, thus enabling to handle a multilevel clustering. In the studies cited above, the population is divided in clusters, each cluster is divided in subclusters and several individuals belong to the same subcluster. The underlying idea is that data are correlated in some way, and a realization of a random effect is common to all observations in the same cluster or subcluster. Stratification levels are nested, because each subcluster belongs to only one cluster, and this restriction can be relaxed by considering two non-nested random effects. Let i (i = 1 . . . I) be the cluster index, j (j = 1 . . . Ji ) the subcluster index and k (k = 1 . . . Kij ) the individual index. The hazard function is written as: λijk (t) = vi wij λ0 (t)λ1 (Xijk (t)), (1) where vi is the cluster specific random effect and wij the subcluster specific random effect, whom distributions are respectively denoted hv (vi ; α) and hw (wij ; η). We assume that hv and hw have known Laplace transform. Large variances of the random effects mean a tighter positive association among units of the same group and greater differences between groups. Notice that 3
wij is an individual effect when Kij = 1, ∀(i, j), and the subcluster random effect does not need to be necessarily shared among several spells. Conditionnaly on both random effects and explanatory variables, the observations are independent. If we suppose the vi and wij to be independent, we obtain the likelihood by taking the product of the random effects densities with the the conditional likelihood function, which is the conditional hazard time the conditional survivor function: L(β, α, η, v, w) =
I Y
PJi PKij
vi
k=1 δijk
j=1
i=1
Ji Y
PKij
wij
j=1
Kij Y
[λ0 (tijk )λ1 (Xijk (tijk ))]δijk
k=1
tijk
Z exp −
k=1 δijk
vi wij λ0 (u)λ1 (Xijk (u))du hv (vi ; α)hw (wij ; η).
0
(2) PKij Let us denote lij = k=1 δ , that is the number of transitions in subcluster PKij Pijk i j of cluster i, and mi = Jj=1 k=1 δijk the number of transitions in cluster i. The log-likelihood is written: ln L(β, α, η, v, w) =
I X
(mi ln vi + Ji Kij ln hv (vi ; α))
i=1
+
Ji I X X
(lij ln wij + Kij ln hw (wij ; η))
i=1 j=1
+
Kij Ji X I X X
(3) δijk ln [λ0 (tijk )λ1 (Xijk (tijk ))]
i=1 j=1 k=1
Z
tijk
− vi wij
! λ0 (u)λ1 (Xijk (u))du .
0
3
Inference using the EM algorithm
The structure of the MPH model makes it ideally suited for the EM algorithm. Since the vi and wij are unobserved, the likelihood cannot be calculated but as Lancaster (1990, p.197) notices, it can be evaluated. Durations specific to cluster i are denoted by Ti , meanings Ti = (ti1 , . . . , tiJi ). Non-censoring indicators are denoted in a similar way, that is di = (δi1 , . . . , δiJi ). If we assume both random effects to be mutually independent, considering a likelihood where α and η are profiled out lead us to the part of the log-
4
likelihood which depends only on β and gives us the E step at iteration (q): (q)
(q)
(q)
Q(β, β , α , η ) =
Kij Ji X I X X
δijk ln [λ0 (t)λ1 (Xijk (tijk ))]
i=1 J=1 k=1
Z +Eβ (q) ,α(q) ,η(q) [vi |Ti , di ] Eβ (q) ,α(q) ,η(q) [wij |Tij , dij ]
tijk
! λ0 (u)λ1 (Xijk (u))du .
0
(4) We thus need to evaluate Eβ (q) ,α(q) ,η(q) [vi |Ti , di ] et de Eβ (q) ,α(q) ,η(q) [wij |Tij , dij ]. Using an approach similar to the one presented in Parner (1997), we show in appendix (A) the following results: i h P R 1+l (l ) Kij tijk λ (u)λ (X (u))du Ew wij ij Lv ij wij k=1 0 1 ijk 0 i , P R h Eβ (q) ,α(q) ,η(q) [wij |Tij , dij ] = Kij (l ) tijk λ (u)λ (X (u))du Ew wlij Lv ij wij k=1 0 1 ijk 0 (5) P R i (lij ) Kij tijk vi k=1 0 λ0 (u)λ1 (Xijk (u))du Ev vi j=1 Lw i . h Eβ (q) ,α(q) ,η(q) [vi |Ti , di ] = PKij R tijk (lij ) mi QJi wij k=1 0 λ0 (u)λ1 (Xijk (u))du Ev vi j=1 Lw h
1+mi QJi
(6) When vi and wij are gamma distributed, these two expectations do not have analytical solutions and the E step requires numerical integrations, as in Sastry (1997). The problem of the E-step can be seen in another way. Considering as a fixed effect the expectation of one of the frailty, the function Q(β, β (q) , α(q) , η (q) ) is the E step of a model with only one frailty. I suggest to lead the inference in an iterative way, alterning between two EM algorithms: during the first EM, we consider a model with a first random effect where the expectation of the second effect is treated as an offset. During the second EM algorithm, we consider a model where the second random effect is the only source of unobserved heterogeneity, the first effect being this time treated as an offset equal to the estimates obtained in the first EM algorithm. The strength of this approach is to transform a rather complicated problem of inference in a model with different random effects in two sub-problems requiring each the estimation of a model with only one random effect using the EM algorithm, this point being well known. In the next section, we present the approach used to estimate the two sub-models. 5
4
Estimation of the MPH models with one random effect
Numerous studies set λ1 as the exponential and base inference on the EM algorithm. Considering gamma unobserved heterogeneity, Gill (1985) suggests it in a multiple spell setting in a note on Clayton and Cuzick (1985). Johansen (1983) shows that the partial likelihood is a likelihood where λ0 is profiled out, this later being recovered with the Breslow estimator (ie NelsonAalen estimator when there is no covariate). Gill (1985) suggests also to use this approach in presence of unobserved heterogeneity, Klein (1992) implements it considering a gamma distributed random effect and Parner (1997) extends it to all shared frailty models with a known Laplace transform of the mixing distribution. Lancaster (1990, p.264-268) also use Johansen’s (1983) approach considering an individual gamma random effect. In this section, partial likelihood in a model with one random effect and Therneau et al ’s (2003) result are briefly recalled. Consider the model: λij (t) = vi λ0 (t) exp(Xij (tij )β),
(7)
where i (i = 1 . . . I) is the cluster index, j (j = 1 . . . Ji ) the spell index and vi a cluster specific random effect. Define the risk set as the set of spells still not completed at the instant juste before tij , denoted by Rij . The associated partial likelihood (see for example Cox, 1972, and Cox, 1975) is: #δij " Ji I Y Y vi exp (Xij (tij )β) P . (8) LP L (β, v) = (k,l)∈Rij vk exp (Xkl (tij )β) i=1 j=1 Assume the vi follow a gamma distribution with expectation 1 and variance 1/α. Therneau et al. (2003) demonstrate that the solution for this model can be exactly obtained by maximizing the following penalized partial likelihood: I
1X ln LP P L (β, v, α) = ln LP L (β, v) − (ln vi − vi ). α i=1
(9)
In a general penalized likelihood setting, 1/α is a smoothing parameter indicating the tradeoff between the fit to the data and smoothness of the penalized likelihood. The solution to the penalized partial likelihood above is equivalent to the EM solution for an MPH model with a gamma shared frailty such as (7) in partial likelihood approach. This result relies on the choice of the penalty function and does not hold if one uses as a penalty function the 6
R∞
2
(2) λ0 (u)
(2)
du, where λ0 (u) stands for the second derivativ of quantity 0 the baseline hazard, as done for example in Rondeau et al. (2003) to ensure the convenient properties described in Montricher et al. (1975). We implement two penalized partial likelihood maximization algorithm instead of two EM algorithms to estimate both sub-models. Each algorithm is organized in two loops, and I describe here the algorithm corresponding to the model where the community random effect is considered as an offset. At iteration (q), the first loop maximizes the log penalized partial likelihood where α is profiled out (see Therneau et al, 2003) and returns α(q) . The second loop consider this α(q) as fixed, uses a Newton-Raphson procedure (q) to optimize the penalized partial likelihood and returns (βv , v (q) ). Once the maximum is reached, v (q) is passed to the fitting program of the second sub-model. It is then treated as an offset, the two loops are executed, the (q) algorithm returns (βw , w(q) , η (q) ) and iteration (q+1) starts. Both algorithms (q) (q) are iterated, βv = βw and the estimated random effects are stable once convergence is achieved. The integrated baseline hazard at iteration (q) can be recovered with the Breslow estimator: b (q) Λ 0 (t) =
X tijk