Statistical Methods in Medical Research http://smm.sagepub.com
Inference for multi-state models from interval-censored data D Commenges Stat Methods Med Res 2002; 11; 167 DOI: 10.1191/0962280202sm279ra The online version of this article can be found at: http://smm.sagepub.com/cgi/content/abstract/11/2/167
Published by: http://www.sagepublications.com
Additional services and information for Statistical Methods in Medical Research can be found at: Email Alerts: http://smm.sagepub.com/cgi/alerts Subscriptions: http://smm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations (this article cites 16 articles hosted on the SAGE Journals Online and HighWire Press platforms): http://smm.sagepub.com/cgi/content/refs/11/2/167
Downloaded from http://smm.sagepub.com at PENNSYLVANIA STATE UNIV on April 12, 2008 © 2002 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
Statistical Methods in Medical Research 2002; 11: 167–182
Inference for multi-state models from interval-censored data D Commenges INSERM U330, Bordeaux, France
Clinical statuses of subjects are often observed at a nite number of visits. This leads to interval-censored observations of times of transition from one state to another. The likelihood can still easily be written in terms of both transition probabilities and transition intensities. In homogeneous Markov models, transition probabilities can be expressed simply in terms of transition intensities, but this is not the case in more general multi-state models. In addition, inference in homogeneous Markov models is easy because these are parametric models. Non-parametric approaches to non-homogeneous Markov models may follow two paths: one is the completely non-parametric approach and can be seen as a generalisation of the Turnbull approach; the other implies a restriction to smooth intensities models. In particular, the penalized likelihood method has been applied to this problem. This paper gives a review of these topics.
1
Introduction
When biostatisticia ns began to be interested in survival data, the problem of rightcensored data occurred immediately, and very soon, Kaplan and Meier1 gave the nonparametric maximum likelihood estimator (NPMLE) of the survival function. Cox 2 proposed the proportional hazard model together with the partial likelihood, and the properties of these methods were studied using the counting process theory by Aalen.3 The Nelson–Aalen estimator was extended to multi-state models by Aalen and Johansen.4 It was soon realized that survival data methods were useful for modeling any time from an origin to an event, and that this event was not necessarily death but could be the onset of a disease or any clinical event of interest. However, while time of death can often be observed nearly exactly when it is not right-censored, clinical status is generally observed at a given number of visits, leading to interval-censored observations. Interval-censored data were rst studied by Peto5 who gave the NPMLE for the distribution function in that case. His approach was generalized by Turnbull6 who treated the case of both interval-censored and truncated data and this was further extended to include covariates. 7 , 8 It must be noted that treating interval-censored data is considerably more difcult than treating right-censored data, both analytically and numerically: compare for instance the complexity of the Kaplan–Meier and the Peto–Turnbull estimators. The aim of this paper is to consider inference for multi-state models from intervalcensored data. For the homogeneous Markov model (HMM) the solution of this Address for correspondence: D Commenges, INSERM U330, 146 rue Leo Saignat, Bordeaux, 33076, France. E-mail:
[email protected]
# Arnold 2002
10.1191=0962280202sm279ra
Downloaded from http://smm.sagepub.com at PENNSYLVANIA STATE UNIV on April 12, 2008 © 2002 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
168
D Commenges
problem has long been known,9 although not widely used in medical research or epidemiology. For non-homogeneous Markov models (NHMM) or semi-Markov models the problem of inference is considerably more difcult. One of the key differences between the two cases is that the HMM are, by denition, parametric. Another key point is that the relationship between transition probabilities and transition intensities is easy in HMM and not in other models. Finally, it must be noted that interval-censoring in multi-state models gives rise to a new difculty, which does not arise in survival models: generally several paths are possible for going from state h to state j between time s and time t, so it is not known which transitions have occurred. Only a few works have been carried out regarding this problem. One of the rst such works was that of De Gruttola and Lagakos, 1 0 who studied HIV infection and AIDS in a cohort of haemophiliacs, the date of infection being unknown; however, the authors did not explicitly recognize that this was a multi-state model and the problem was treated in discrete time. The rst explicit non-parametric treatment of interval-censored observations from a multi-state model in continuous time was given by Halina Frydman, who treated a progressive three-state model1 1 ,1 2 and a special case of the illness–death model. 1 3 The penalized likelihood approach 1 4 already proposed for interval-censored survival data 1 5 was extended to a three-state progressive model by Joly and Commenges1 6 and to the illness–death model by Joly et al. 1 7 In Section 2 basic denitions and modeling will be given. Different patterns of observations will be considered in Section 3; in Section 4 the expression for the likelihood will be given and different approaches of likelihood-based inference will be considered in Section 5. Finally, some examples will be shown in Section 6.
2
Basic de nitions and modelling
2.1 Multi-state processes Consider a process X ˆ …X…t†; t ¶ 0†. A multi-state process is a process that can take a nite number of states, that is, for any t, the variable X(t) has values in f0; 1; . . . ; Kg. The law of multi-state processes can be specied by the transition probabilities p hj …s; t; F …s¡†† ˆ P‰X…t† ˆ jjX…s† ˆ h; F …s¡†Š; h ˆ 0; . . . ; K; j ˆ 0; . . . ; K. The history F …s† of the process is generated by fX…u†; u µ sg. In probabilistic language, F …s† is an element of a ‘ltration’, but it can be understood intuitively as the trajectory of the process until time s. If the notion of state is to be useful, at least part of the history must be unnecessary to predict future evolution given the state at time s. More or less stringent assumptions can be considered. A Markov model is obtained if we assume that, given the state at time s, the whole history before s can be forgotten: p hj …s; t; F …s¡†† ˆ P‰X…t† ˆ jjX…s† ˆ hŠ. The matrices with elements p hj …s; t† are the transition probability matrices and will be denoted P…s; t†. In the homogeneous Markov process these quantities depend only on t ¡ s; thus we have P…s; t† ˆ P…0; t ¡ s†; by a slight abuse of notation we will use a matrix depending on just one time argument putting P…t ¡ s† ˆ P…0; t ¡ s†. There is the important Chapman–Kolmogorov equation P…s; t† ˆ P…s; u†P…u; t†, s < u < t (also called the semi-group property). For the semi-Markov process called the current state
Downloaded from http://smm.sagepub.com at PENNSYLVANIA STATE UNIV on April 12, 2008 © 2002 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
Inference for multi-state models from interval-censored data
169
model by Hougaard, 1 8 the transition probabilities depend in addition on the time at which the last transition before s occurred. General transition intensities ¬hj …t; F …t¡††, for h 6ˆ j, may be dened when p hj …s; t† is continuous both in s and in t for all s and t, as the following limits (if they exist): lim p hj
¢t! 0
…t; t ‡ ¢t; F t¡ † : ¢t
…1†
It is reasonable in most applications in epidemiology to think that these limits exist; it is even reasonable to expect continuous and smooth transition intensities. We dene P P ¬hh …t† ˆ ¡ j6ˆh ¬hj …t† and j6ˆh ¬hj …t† is the hazard function associated with the distribution of the sojourn time in state h. For non-homogeneous Markov processes we have ¬hj…t; F …t¡†† ˆ ¬hj…t† for h, j ˆ 0; . . . ; K. If it is assumed that the transition intensities do not depend on time we write ¬hj…t† ˆ ¬hj. This is the homogeneous Markov process. For the mathematical treatment of the non-parametric approach, the integrated (or cumulative) transition intensities are „ t needed. If the limits (1) exist for all t the integrated transition intensities are: Ahj …t† ˆ„ 0 ¬hj…u† du. If for all t they are either 0 or do not t exist, we can still dene Ahj…t† ˆ 0 dAhj …u† with dAhj…u† ˆ p hj …t† ¡ p hj…t¡†; in this case A…¢† is a step function and we will call this case the discrete intensity case. This case is interesting because although we may think that biological phenomena may be described by processes with continuous transition intensities, the non-parametric maximum likelihood estimator (the Kaplan–Meier–Nelson–Aalen estimator for survival problems or the Aalen–Johansen estimator for Markov models) leads to discrete intensities. In the continuous case the matrix of transition intensities is the derivative of the matrix A of integrated transition intensities and we will call it A0 …t†. It is important to realize that a multi-state process can be described either by its transition probabilities or by its transition intensities. We will need to manipulate both; models will be proposed for intensities while transition probabilities are needed for writing the likelihood. One key issue is how these two objects are related. In the rst place the transition intensities are dened as limits of transition probabilities as in equation (1). Secondly, for Markov processes, the Kolmogorov equations (derived from the Chapman–Kolmogorov equation) are differentia l equations linking transition probabilities and intensities. However, it is useful, in particular for numerical computations, to have explicit formulae for the transition probability matrix as a function of intensities. For Markov processes, the product integral1 9 is such an expression. However, it has not yet been used directly for numerical computations in the continuous case. It takes the form of an ordinary matrix product in the discrete intensities case where dA…t† ˆ 0 for all t except for a nite number of values tj ; j ˆ 1; . . . ; m: P…s; t† ˆ
Y
j:s