Maximum a Posteriori Estimation in Dynamical ...

1 downloads 0 Views 2MB Size Report
May 11, 2012 - Australia PHAEDRA cohort (Tony Kelleher, David Cooper, Pat Grey, ..... Rue, H., S. Martino, and N. Chopin (2009): “Approximate Bayesian ...
Statistical Communications in Infectious Diseases Volume 4, Issue 1

2012

Article 2

Maximum a Posteriori Estimation in Dynamical Models of Primary HIV Infection Julia Drylewicz, UMC Utrecht, The Netherlands Daniel Commenges, INSERM U897, Bordeaux Segalen University, France Rodolphe Thiebaut, INSERM U897, Bordeaux Segalen University, France

Recommended Citation: Drylewicz, Julia; Commenges, Daniel; and Thiebaut, Rodolphe (2012) "Maximum a Posteriori Estimation in Dynamical Models of Primary HIV Infection," Statistical Communications in Infectious Diseases: Vol. 4: Iss. 1, Article 2. DOI: 10.1515.1948-4690.1040 ©2012 De Gruyter. All rights reserved. Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Maximum a Posteriori Estimation in Dynamical Models of Primary HIV Infection Julia Drylewicz, Daniel Commenges, and Rodolphe Thiebaut

Abstract Dynamical models based on ordinary differential equations (ODE) are increasingly used to model viral infections such as HIV. This kind of model is based on biological knowledge and takes into account complex non-linear interactions between markers. The estimation of such models is made difficult by the lack of analytical solutions and several methods based on Bayesian or frequentist approaches have been proposed. However, because of identifiability issues, in a frequentist approach some parameters have to be fixed to values taken from the literature. In this paper we propose a Maximum A Posteriori (MAP) approach to estimate all the parameters of ODE models, allowing prior knowledge on biological parameters to be taken into account. The MAP approach has two main advantages: the computation time can be fast (relative to the full Bayesian approach) and it is straightforward to incorporate complex prior information. We applied this method to an original model of primary HIV infection taking into account several biological hypotheses for the HIV-immune system interaction. Parameters were estimated using HIV RNA load and CD4 count measurements of 710 patients from the Concerted Action on SeroConversion to AIDS and Death in Europe (CASCADE) Collaboration. KEYWORDS: HIV, maximum a posteriori, ordinary differential equations models, primary infection Author Notes: Authors: Julia Drylewicz: UMC Utrecht, Department of Immunology, The Netherlands; INSERM U897, France. Daniel Commenges: INSERM U897, Bordeaux Segalen University, ISPED, France. Rodolphe Thiebaut: INSERM U897, Bordeaux Segalen University, ISPED, France. Corresponding author: Rodolphe Thiébaut INSERM, U897 Epidemiology and Biostatistics Bordeaux Segalen University 146, rue Leo Saignat, F-33076 Bordeaux, France. Email: [email protected]

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

1

Introduction

Primary Human Immunodeficiency Virus (HIV) infection is a highly dynamic period characterized by intensive viral replication associated with a CD4+ T lymphocytes depletion, followed by a stable level of viral load and CD4 count called the ”setpoint” (Mellors et al., 1995). The primary infection stage corresponds to the first months of infection and determines the clinical progression (Mellors et al., 1995). The level of viral load at the time of the peak (between 105 and 107 copies/mL) is correlated with the viral setpoint (Lindback et al., 2000). This setpoint is itself associated with clinical progression (Mellors et al., 1995), as well as the immune activation setpoint (Deeks et al., 2004; Ganesan et al., 2010). The higher the viral and immune activation setpoints, the faster the progression to AIDS. Also, a symptomatic primary infection is predictive of a fast clinical progression (Henrard et al., 1995; Vanhems et al., 1998). Thus, the study of primary infection seems to be essential to a better understanding of the global evolution of HIV infection. Indeed, this initial period provides an opportunity to examine the interaction between the virus and the immunologically competent host. Unfortunately little data from recently infected patients are available. Moreover, because this stage of infection is asymptomatic or has non-specific symptoms in most the cases (Kahn and Walker, 1998), the date of infection is unknown. Several attempts have been made to model marker dynamics during primary HIV infection and, in particular, the viral dynamics using parametric or nonparametric models. Models based on Ordinary Differential Equations (ODE) have been used to study the interaction between HIV and the immune system. These models are based on biological knowledge and take into account complex nonlinear interactions between HIV and potentially multiple populations of T lymphocytes. The trajectories of viral load and CD4 count have often been obtained by simulation as in Phillips (1996); Tuckwell and Le Corfec (1998); Culshaw and Ruan (2000); Wick et al. (2002); Snedecor (2003); Wodarz and Hamer (2007); Burg et al. (2009); Iwami et al. (2009) by fixing all parameter values according to data published in the literature. Unfortunately, truly accurate information on each parameter is lacking, and ODE models can be very sensitive to some parameter values, leading to very different trajectories for small changes in values. Studies that have attempted to make inference from real data (Kaufmann et al., 1998; Little et al., 1999; Stafford et al., 2000; Ribeiro et al., 2010) neglected CD4 count dynamics and were only interested in the early viral dynamics. These studies included only a few highly selected patients, having intensively repeated blood measurements. They attempted to estimate the time of viral peak, as well as its value (Kaufmann et al., 1998) using an imputation of the date of infection - most often the time of onset of clinical symptoms. Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

1

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

The estimation of the parameters of models based on non-linear ODE is made difficult by the lack of analytical solutions. Several methods based on Bayesian or frequentist approaches have been proposed (Putter et al., 2002; Filter et al., 2005; Verotta, 2005; Huang et al., 2006; Samson et al., 2006; Guedj et al., 2007a; Cao et al., 2008; Toni et al., 2009; Commenges et al., 2011; Lavielle et al., 2011). There are major identifiability issues in these models. Even if all the parameters are theoretically identifiable, practical identifiability issues (Guedj et al., 2007b) are still important. Indeed, measurements are almost never available for all biological compartments included in the model. Therefore, it is often necessary to fix some parameters. Usually, parameters for which we have ”reasonable” estimates from experimental studies are fixed. For instance, the death rate of free HIV virions is often fixed between 3 and 30 day−1 (Stafford et al., 2000; Guedj et al., 2007a; Drylewicz et al., 2010b). The Bayesian approach appears to be more flexible since it allows the introduction of the necessary a priori knowledge in a more flexible way. The Markov chain Monte Carlo (MCMC) algorithm is most often used for producing a posteriori distributions of the parameters. In such complex models, however, it can be intractable. Recently, it has been proposed that the use of MCMC might be avoided using approximations (Rue et al., 2009), but even this approach is time-consuming in very complex models. Another alternative is to revert to an old method which has the advantage of simplicity: the Maximum A Posteriori (MAP) estimation is still used in some areas such as biomedical imaging, neurosciences, engineering or agronomy (Commenges, 1984; Ducrocq, 1994; Wang et al., 2008; Koyama and Paninski, 2009). The MAP estimate can be seen as a Bayes estimate of the parameters vector when the loss function is not specified (DeGroot, 1970). It can also be seen as using a normal approximation of the posterior distribution. The MAP has two main advantages: (i) the computation time can be fast, relatively to MCMC and (ii) it is straightforward to incorporate complex prior distributions. In this paper, we estimate and compare ODE models of primary HIV infection using MAP estimators. The paper is organized as follows: in section 2, we describe the MAP estimator for a general non-linear mixed model. In section 3, we present an ODE model for primary HIV infection. We applied the MAP estimator to a real data set of 710 HIV seroconverters from the Concerted Action on SeroConversion to AIDS and Death in Europe (CASCADE) Collaboration in section 4, and we realized a comparison between the MAP and the full Bayesian approach using simulations in section 5. A conclusion is given in section 6.

2

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

2 2.1

Inference in non-linear mixed models: MAP estimator, score tests and model selection Complexity of the inference for HIV models for primary infection

Inference for HIV models based on ODE systems without an analytical solution, having many parameters, incorporating random effects for representing the intersubject variability and taking left-censoring (due to detection limits of viral load) into account is very challenging. In primary HIV infection, there are additional difficulties, the most prominent being that the infection date is not precisely known. This problem was tackled in Drylewicz et al. (2010b). Here we aim to use the MAP approach for enhancing practical identifiability together with two tools that can be adapted for the MAP approach: score tests and cross-validation. For the sake of simplicity, we present the MAP approach for estimating parameters of a non-linear mixed-effects model using repeated measurements of one fully observed (without left-censoring) marker. The method is easily extended for fitting multiple markers, including left-censored, and will be applied in section 4.

2.2

The non-linear mixed model

For subject i with i = 1, . . . , n, we consider a non-linear mixed-effects model for the observed vectors Yi = (Y ji , j = 1, . . . , ni ): Y ji = g(t ij , ξ˜i ) + εij ,

j = 1, ..., ni ,

(1)

where Y ji is the jth measurement for subject i at the time t ij , the εij are independent Gaussian measurement errors with zero mean and variances σ2 . The Yi ’s are assumed to be mutually independent. The function g(.) is a twice differentiable function (for instance g is the base-10 logarithm of the viral load, see section 3.2). The individual parameters ξ˜i = (ξ˜il , l = 1, . . . , p) are modeled as a linear form: ξ˜ il = φl + ωl uil + zil βl ,

(2)

where φl is the intercept, ωl is the standard deviation of the random effect and zil is a vector of explanatory variables of the l-th parameter. The βl ’s are vectors of regression coefficients. We assume ui ∼ N (0, I p ), where ui = (uil , l = 1, . . . , p) is the individual vector of random effects. We denote by θ = (φl , ωl , βl , l = 1, . . . , p) the set of parameters of the model. Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

3

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

We denote by L Yθ i |ui (Yi |u), the likelihood of observations for subject i given

that the random effects ui take the value u. Given ui , the Y ji are independent, so L Yθ i |ui (Y i |u) = ∏ j L Yθ i |ui (Y ji |u) where: j

) ( i − g(t i , ξ˜i ))2 (Y 1 j j L Yθ i |ui (Y ji |u) = √ . exp − 2 j 2σ 2πσ

(3)

The observed log-likelihood for subject i is: Lθi

= log

Z

Rp

L Yθ i |ui (Y i |u)φ(u)du ,

(4)

where φ is the multivariate normal density of N (0, I p ). We denote Lθ = Lθi + . . . + Lθn the global log-likelihood.

2.3

Maximum A Posteriori estimator

From a frequentist point of view, the vector θ is deterministic and the maximum likelihood estimator (θˆ ML ) is defined by: θˆ ML = Argmaxθ∈Θ Lθ , where Θ is the parameter space, here a subset of R p . In a Bayesian approach, the uncertainty on θ is modeled by a probability distribution: initially, this is the prior, specified by the probability density π(θ); after taking the observations into account, this is the posterior, π(θ|Y ). The MAP estimator is defined by: θˆ MAP = Argmaxθ∈Θ π(θ|Y ). Not only is the posterior distribution consistent (Diaconis and Freedman, 1986), but, according to the Bernstein-von Mises Theorem, it also converges under ”remarkably weak conditions” to a normal distribution (Van der Vaart, 2000, chap. 10). This result can be used to approximate the posterior distribution by a normal with expectation given by the MAP and variance given by the Hessian of − log π(θ|Y ) computed at the MAP value (Carlin and Louis, 2009). Applying Bayes’ theorem, we have:   MAP θˆ MAP = Argmaxθ∈Θ Lθ , with LθMAP = Lθ + log(π(θ)). (5) The prior can be improper - that is, the prior measure need not be a probability measure. In this sense the Maximum Likelihood Estimator is a special case of the MAP, taking flat prior distributions for all the parameters (in fact, Lebesgue measure on Θ). Most often in Bayesian analysis, the prior is chosen such that the parameters are independent.

4

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

The MAP can be seen as a penalized likelihood (Commenges, 2009) since there will be a larger penalty for values of θ having a small prior density than for values having a large prior density; inversely, any reasonable penalized log-likelihood can be seen as the log of a possibly improper a posteriori density. Whichever point of view is taken, the MAP allows the introduction of a priori knowledge on the parameters and this is why it is useful in complex modeling. There may be knowledge from the literature on the values of some parameters and this can be expressed by using independent Gaussian prior distribution for the parameters on which there is such a priori knowledge, while keeping flat prior distributions for the other param−µl )2 eters. This results in adding penalties of the type − (θl2κ , where µl and κ2l are the 2 l

expectation and variance of the prior for θl . The observed scores are computed for ∂LMAP

θl −µl θ θ all l as: Uβ•l = ∂β = ∂L ∂βl − κ2l and the Hessian is approximated as proposed by l Commenges et al. (2006, section 4.5). The link with the penalized likelihood allows the easy incorporation of more complex a priori knowledge on the parameters. Instead of taking a weighted sum of squared deviations of the parameters from their prior expected values as the penalty, we can take any function of the parameters for which there is a priori knowledge. One issue with the MAP is that it is not invariant under non-linear transformations; however, our simulation study (section 5) shows that the MAP approach works well in practice.

2.4

Model selection

Using again the link between MAP and penalized likelihood, it is possible to apply to MAP estimates model selection criteria developed for penalized likelihood. In particular, an approximation of the likelihood cross-validation criterion (LCVa ) proposed by Commenges et al. (2007) can be computed relatively quickly and is appropriate in our context. This criteria is defined by: −1 LCVa = −n−1 [Lθˆ MAP − Tr(HMAP H)],

(6)

−1 is the inverse of the Hessian of LθMAP calculated at the MAP estimawhere HMAP tor and H and Lθˆ MAP are the Hessian and the likelihood without the penalty term computed in θˆ MAP .

2.5

Score test

The theory of score tests developed in a frequentist framework can be extended to the MAP approach. We have shown the advantages of score tests in the context Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

5

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

of HIV dynamics models for selecting explanatory variables and random effects in Drylewicz et al. (2010a), and also for testing the covariance of random effects in Drylewicz et al. (2010b, appendix A). Indeed, as the MAP estimator is asymptoticly equivalent to the maximum likelihood estimator (Diaconis and Freedman, 1986), and as the penalty term is independent of the tested parameter under the null hypothesis (i.e. the tested parameter is null), we can extend the test statistics previously described. 2.5.1

Score test for explanatory variable

If we want to test a possible effect of an explanatory variable on the l-th parameter ξil , the null hypothesis H0 is ”βl = 0”. If H0 is true and θˆ 0 denotes the MAP estimator then the score statistic is: Uβ• (θˆ 0 ) S= q l , • ˆ c Uβ (θ0 ) var

(7)

l

∂Lθ c Uβ• (θˆ 0 ) is a consistent estimator of var Uβ• (θˆ 0 ). We where Uβ•l (θ) = ∂β |θ and var l l l 2 c U • (θˆ 0 ) = ∑i U i . This statistic has an asymptotic take U i (θˆ 0 ) = ziU i (θˆ 0 ) and var MAP

βl

l φl

βl

βl

N (0, 1) distribution under H0 .

We can also define a global test: the null hypothesis H0 is ”βl = 0, l = c U• ]−1 U• , where U• is the vector of 1, . . . , p”. The test statistic is SG = U•T [var • scores Uβl (θˆ 0 ) for l = 1, . . . , p. This statistic has an asymptotic χ2p distribution under H0 . 2.5.2

Score test of homogeneity

If we want to test the presence of a random effect on the l-th parameter, the null hypothesis H0 is ”ωl = 0”. If H0 is true, we propose the score statistic: bH [T ] T −E 0 SH = p , c H0 [T ] var

T = UT W ∗ U,

(8)

bH [T ] and var c H0 [T ] are estimators of the expectation and the variance of T where E 0 under H0 , U is the vector of dimension N = ∑i ni of

ij Ul

∂ log L θ i

=

Y j |uil

∂εijl

|θˆ 0 , εijl = ωl uil

and W ∗ is the correlation matrix of uil minus the identity matrix. This statistic has bH [T ] and var c H0 T can be obtained an asymptotic N (0, 1) distribution under H0 . E 0 using a parametric bootstrap. 6

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

3 3.1

Dynamical modeling of HIV infection Model for the system

HIV preferentially infects proliferating cells. Therefore, it is assumed that not all CD4 cells, but only a fraction of them denoted T , are targets of the virus (Grossman et al., 2002). It has been shown in vitro that the time-lag between virus entry and virus production is about 1 day (Kiernan et al., 1990; Barbosa et al., 1994). Therefore, we distinguished non-productive infected cells I from productive infected cells P as previously proposed by (Nowak et al., 1997). In view of the above considerations, we developed an ODE model for HIV infection with five compartments: Q (non-infected quiescent CD4), T (non-infected target CD4), I (non-productive infected CD4), P (productive infected CD4) and V (free virions). Another problem is that such models are not adapted to describe the dynamics in the very early phase of the infection. After an infectious contact, the virus first replicates locally, and is then transported to lymph nodes, where the intense replication that our model aims to capture occurs. This initial phase of 5 to 10 days is called the ”eclipse phase” (McMichael et al., 2010). To model this phase, we assumed a latent time denoted υ between the introduction of the virus at time 0 and the beginning of the production of new virions occurring mainly in lymphoid organs. Thus not only the date of the infectious contact, but also the additional latent time υ before the dynamics described by our ODE system are established are unknown. Lastly, it is known that the immune response against HIV is established after the eclipse phase and becomes maximal at the time of peak viral load. Indeed, during the first weeks of infection (Turnbull et al., 2009; McMichael et al., 2010), the number of HIV-specific CD8+ T-lymphocytes (CD8) increases, and the total number of CD8 cells can reach 20-times the normal level (Pantaleo et al., 1997). This increase appears very early during infection (Koup et al., 1994; Borrow et al., 1994) and matches the viral decline (Koup et al., 1994; Musey et al., 1997; Kassutto et al., 2004). To model the inception of this specific immune response, we considered a time-dependent death rate of infected cells. The death rate is constant (µI (0)) from infection time (t = 0) to t1 , increases linearly between t1 and t2 , and plateaus at (µI (0) + ∆µI ) after t2 :  if t ≤ t1  µI (0) ∆µI t1 ∆µI µI (t) = (9) µ (0) − t2 −t1 + t2 −t1 t if t1 ≤ t ≤ t2 .  I µI (0) + ∆µI if t ≥ t2 Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

7

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

Figure 1: Graphical representation of the model: Quiescent cells Q are produced at rate λ, are activated into target cells T at rate α and die at rate µQ . Target cells return to the quiescent state at rate ρ, are infected at rate γV and die at rate µT . Non-productive infected cells I become productive at rate δ; then they produce new virions at rate π. All infected cells die at the same rate µI (t). Virions V die at rate µV . This model can be written for a subject i as:  i dQ i i i i i i i   dt = λ − α Q + ρ T − µQ Q     dT i  i i i i i i i i i   dt = α Q − γ V T − ρ T − µT T   i dI i i i i i i i . dt = γ V T − µI (t)I − δ I    dPi  = δi I i − µiI (t)Pi  dt    i  i i i i   dV dt = π P − µV V

(10)

Quiescent cells Q are produced at rate λ, are activated into target cells T at rate α and die at rate µQ . Target cells return to the quiescent state at rate ρ, are infected at rate γV and die at rate µT . Non-productive infected cells I become productive at rate δ; then they produce new virions at rate π. All infected cells die at the same rate µI (t). Virions V die at rate µV . Figure 1 is a graphical representation of the model and the meanings of the parameters are summarized in Table 1. We assume that the model is at equilibrium before infection, so at t = 0− we have: λ(ρ + µT ) λα , T0− = , I0− = 0, µQ ρ + µT α + µT µQ µQ ρ + µT α + µT µQ P0− = 0, V0− = 0. Q0− =

(11)

8

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

Parameter µI (t) α λ µT π γ ρ µV δ µQ µI (0) ∆µI

Table 1: Parameters of the dynamic model. Meaning (per day) Death rate of infected cells (time-dependent) Rate to become a target cell Rate of CD4 cells production per µL Death rate of target cells Number of virions per infected cell Infection rate of target cells per virion per µL Rate of reversion to the quiescent state Death rate of free virions Rate to become a productive infected cell Death rate of quiescent cells Initial death rate of infected cells Difference between initial and final death rate of infected cells

The initial inoculum of virions is fixed at 10−6 virions/mm3 (Ciupe et al., 2006; Stafford et al., 2000), which is equivalent to 5 virions for 5 liters of blood. The introduction of virions in the system at t = 0 disrupts the initial stability, and the system stabilizes to a new equilibrium if the reproductive number R0 is larger γπλαδ than one. In our model, R0 = µI µV (δ+µI )(ρµ . If R0 > 1, the new equilibQ +αµT +µT µQ ) rium is: λγδπ+µI µV δρ+µ2I µV ρ Q¯ = , γδπ(α+µ ) Q

I )µV µI T¯ = (δ+µγδπ , −µ ρµ µ δ−µ ρµ2 µ −µ αµI µV δ−µT αµ2I µV −µT µQ µI µV δ−µT µQ µ2I µV +λγδπα I¯ = Q I V Q I V Tγδπ(αδ+αµ , I +µQ δ+µQ µI ) ¯ P¯ = µδI I, ¯ V¯ = µπδ I, V µI

where µI = µI (0) + ∆µI . To ensure the positivity of parameters, we applied a logarithmic transformation to all parameters (denoted by ˜). For each subject i, the parameters vector is: ˜ i , µ˜ i , π˜ i , γ˜ i , ρ˜ i , µ˜ i , δ˜ i , µ˜ i , ∆µ ˜ I ). (˜µiI (0), α˜ i , λ T V Q

3.2

Model for the observations

Generally, we do not directly observe all the components of the system, but rather i , the jth measurement M < 5 functions of Xi = (Qi , T i , I i , Pi ,V i ). We observe Y jm of the mth observable component for subject i at time t ijm . The function g defined Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

9

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

in equation (1) is then g = (g1 , ..., gM ) so we have: i Y jm = gm (Xi (t ijm , ξ˜i )) + εijm ,

j = 1, ..., nim

m = 1, ..., M ,

(12)

where the εijm are independent Gaussian measurement errors with zero means and variances σ2m . Specifically, the observed components were the base-10 logarithm of HIV RNA load (g1 (Xi ) = log10 (V i )) and the fourth root of total CD4 count (g2 (Xi ) = (Qi + T i + I i + Pi )0.25 ). These transformations are commonly used for achieving normality and homoscedasticity of measurement error distributions (Thi´ebaut et al., 2003). Random effects can be included for each parameter allowing different individual values; they were selected using score tests (Drylewicz et al., 2010a).

4 4.1

Application to the CASCADE dataset The study sample

CASCADE is an international collaboration pooling data from HIV1-infected patients with a known window for their date of seroconversion and followed in Europe, Canada and Australia (CASCADE, 2003). Data were pooled across the 2009 update from 25 participating cohorts. We selected HIV seroconverters: (i) if the delay between the date of last negative HIV test (d− ) and the date of first positive HIV test (d+ ) was less than 6 months, (ii) if they did not receive any antiretroviral treatment during the first year following d+ and (iii) if they had more than 3 measurements of CD4 or viral load during the year of follow-up with the first detectable viral load measurement during the first three months after (d+ ). A total of 710 patients satisfied these criteria. This sample is different from the one used in our previous papers Drylewicz et al. (2010a,b) because the CASCADE dataset has been updated and we focused on patients with a short delay between the dates of last negative and first positive HIV test (less than 6 months). The characteristics of patients are presented in Table 2. Selected patients had a median of 5 measurements of CD4 count (InterQuartile Range (IQR)= [4; 6]) and 4 measurements of HIV RNA load (IQR= [4; 5]). Follow-up was censored after 1 year beyond d+ , resulting in a median follow-up after d+ of 295 days (IQR= [231; 335]). The median delay between d− and d+ was 91 days (IQR=[49; 132]). Figure 2 shows the observed CD4 and viral load data for the 710 selected patients from the last negative HIV test and the prediction from a local polynomial regression (PROC LOESS; SAS Institute, Cary, N.C.), which uses a non-parametric, weighted leastsquares method for fitting local regression surfaces (Cohen, 1999). Some patients had a detectable viral load at the time of their last negative HIV test; therefore 10

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

10

8

10

7

1600 total CD4 counts mm-3

HIV-1 RNA ml-1

Drylewicz et al.: Maximum a Posteriori Estimation in Dynamical Models

106 105 104 103 102 10

1

0

100 200 300 400 500 Days since last negative serology

600

1400 1200 1000 800 600 400 200 0

100 200 300 400 500 Days since last negative serology

600

Figure 2: Available data of HIV RNA (left) and CD4 count (right) for the 710 selected patients from the CASCADE Collaboration and the predictions from a local polynomial regression. they were already infected. The early decreasing trend in the viral load follows the natural evolution after the peak which occurs during primary HIV infection. It demonstrates the relative richness of this dataset based on observational cohorts.

4.2

Elicitation of the prior and pseudo-observations

We assumed log-normal prior distributions for all biological parameters based on the literature. For some parameters, like the viral death rate (µV ) or the delay to become a productive infected cell (δ), we have a good knowledge of the range of values leading to tight (informative) prior distributions. Therefore, prior distributions were chosen to contain the values from the literature within their 95% confidence intervals. For other parameters, like the infection rate (γ), we have only a vague knowledge leading us to choose wide (uninformative) prior distributions. Table 3 presents the prior distributions used for estimation. Important additional information is provided by the distribution of CD4 counts among non-infected subjects. This distribution can be computed from the parameters of the model, allowing information to be taken into account by adding constraints that the theoretical distribution equals the known distribution. There are two drawbacks: (i) the distribution of CD4 counts is not perfectly known; (ii) the penalized likelihood with the added constraint would no longer be that derived from normal prior distributions. An alternative is to introduce pseudo-observations of CD4 counts for a certain number NH of uninfected subjects. These observations should have a distribution which fits what has been observed, and the number NH needs to be close to the number of subjects used in the studies providing this information (Reichert et al., 1991; Bofill et al., 1992; Maini et al., 1996; Provinciali et al., 2009). Thus we assumed that we had NH = 700 observations that coming from a normal distribution with mean 900 and variance of 1502 . For simplicity, we assumed six intervals of CD4 values (Ii , i = 1, ..., 6): 600-700, 700-800, 800-900, Published by De Gruyter, 2012

Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

11

Statistical Communications in Infectious Diseases, Vol. 4 [2012], Iss. 1, Art. 2

Table 2: Characteristics of 710 patients from the CASCADE dataset included in this study. Brought to you by | Utrecht University Library (Utrecht University Library ) Authenticated | 172.16.1.226 Download Date | 5/11/12 8:39 AM

Characteristics Total Gender Exposure group

Year of enrollment

Reported Seroconversion illness

Age at first positive serology Delay between last negative and first positive serology (in days) Numbers of CD4 measurements Numbers of HIV RNA measurements

Men Women Sex between men Sex between men and women IDU* Unknown Other Sex between men + IDU Sex between men and women + IDU 1985-1995 1996-2000 2001-2008 No Yes

N (%) 710 (100) 558 (79) 152 (21) 456 (64) 211 (30) 20 (3) 14 (2) 4 (