Incorporating unobserved heterogeneity in Weibull ...

Incorporating unobserved heterogeneity in Weibull survival models: A Bayesian approach Catalina A. Vallejos and Mark F.J. Steel Department of Statistics, University of Warwick ⇤

Abstract Flexible classes of survival models are proposed that naturally deal with both outlying observations and unobserved heterogeneity. We present the family of Rate Mixtures of Weibull distributions, for which a random effect is introduced through the rate parameter. This family contains the well-known Lomax distribution and can accommodate flexible hazard functions. Covariates are introduced through an Accelerated Failure Time model and we explicitly take censoring into account. We construct a weakly informative prior that combines the structure of the Jeffreys prior with a proper prior on the parameters of the mixing distribution, which is induced through a prior on the coefficient of variation. This improper prior is shown to lead to a proper posterior distribution under mild conditions. The mixing structure is exploited in order to provide an outlier detection method. Our methods are illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy.

Keywords: Frailty model; Life distribution; Lomax distribution; Outlier detection; Posterior existence

1

Introduction

We propose flexible classes of survival distributions that provide a natural way to deal with outlying observations and other forms of unobserved heterogeneity. This framework extends standard survival models by adding a random effect. This is a special case of frailty models, with a subject-specific frailty term. In particular, we present the family of Rate Mixtures of Weibull (RMW) distributions, for which the frailty is introduced via the rate parameter. This family contains i.a. the Lomax distribution, which is widely used for heavy-tailed models. The RMW family accommodates flexible hazard shapes and relates to the mixed Proportional Hazards (PH) model, developed in econometrics (e.g. Heckman and Singer, 1984a). Our analysis allows an arbitrary parametric random effect (or mixing) distribution and mostly focuses on an Accelerated Failure Time (AFT) specification, where the interpretation of the regression coefficients is unaffected by the ⇤

Corresponding author: Mark Steel, Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK; email: [email protected]. Catalina Vallejos acknowledges funding from the University of Warwick and the Pontificia Universidad Católica de Chile. We are grateful to P.O.D. Pharoah and Jane Hutton for access to the cerebral palsy dataset, and thank Jane Hutton for insightful and constructive comments.

1

choice of mixing distribution. We consider Bayesian inference with such models under weakly informative improper prior distributions, combining the structure of the Jeffreys prior with a proper (informative) prior, elicited through the coefficient of variation. We provide mild and easily verified conditions for posterior existence. The appropriateness of different mixing distributions is assessed using standard Bayesian model comparison methods. Our methodology mitigates the effect of extreme observations and provides an outlier detection method by exploiting the mixing structure. Section 2 discusses the use of mixture families of survival distributions as a solution for the lack of robustness and the deviation from the homogeneity assumption between the observations. It introduces the RMW family and derives some of its properties. Covariates are introduced through an AFT model. Section 3 provides an extensive analysis of Bayesian inference for these regression models, allowing for right censored observations. In Section 4, our methods are illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy. Finally, Section 5 concludes. All the proofs are contained in the Appendix without mention in the text. Code (in R) to implement these models is freely available at http://www.warwick.ac.uk/go/msteel/steel homepage/software/rmwcode.zip, as well as supplementary material with details of the implementation.

2

Mixtures of survival distributions

As in Vallejos and Steel (2014), this paper considers mixtures of lifetime distributions in order to account for unobserved heterogeneity and add robustness to the presence of outliers. Denote by Ti , i 2 {1, . . . , n} the survival times of n independent individuals. We assign a mixture of survival distributions to Ti , i.e. Z f (ti | , ✓) ⌘ f (ti | , ⇤i = i ) dP⇤i ( i |✓), L

where the underlying f (·| , ⇤i = i ) is a “standard” life-time density with parameters , i . A mixing distribution P⇤i (·|✓) with parameter ✓ and support L is assigned to 1 , . . . , n . The so-called mixing parameters are subject-specific random (or frailty) effects and the spread of P⇤i (·|✓) controls the degree of unobserved heterogeneity. Throughout, we assume L = R+ but finite mixtures can also be generated by a discrete L. Varying the underlying model generates a wide class of distributions. This article focuses on mixtures of Weibull distributions. Other examples are mixtures of Birnbaum-Saunders distributions (Balakrishnan et al., 2009) and mixtures of log-normal distributions (Vallejos and Steel, 2014). If the underlying model is supported by theoretical or practical reasons, the intuition is preserved by the mixture. Conditional on the i ’s, survival times are distributed as in the underlying model but with a different i for each individual. For example, if theory suggests that individuals have a constant hazard rate, an exponential model is appropriate. Using mixtures of exponential distributions leads to a decreasing hazard rate, yet does not contradict this theory. In such a case, the individual hazard remains constant over time but high-risk subjects will tend to die earlier, so that a higher proportion of low risk subjects is left to be observed at later times.

2

2.1

Rate Mixtures of Weibull distributions

Definition 1. Let Ti be a positive-valued random variable distributed as a Rate Mixture of Weibull distributions (RMW). Its density function is defined as Z 1 f (ti |↵, , ✓) = ↵ i ti 1 e ↵ i ti dP⇤i ( i |✓), ti , ↵, > 0, ✓ 2 ⇥, (1) 0

where i is a realization of a random variable ⇤i ⇠ P⇤i (·|✓) with support on R+ . Denote this by Ti ⇠ RMWP (↵, , ✓). A hierarchical representation of (1) is Ti |↵, , ⇤i =

i

⇠ Weibull (↵ i , ) ,

(2)

⇤i |✓ ⇠ P⇤i (·|✓).

The Weibull distribution is routinely applied in survival analysis. Its flexibility allows for both increasing and decreasing hazard rates. However, the presence of unobserved heterogeneity can invalidate its use. If neglected, unobserved heterogeneity can lead to an incorrect estimation of the individual hazard rate (Omori and Johnson, 1993). Special cases in the RMW family appear in the existing literature, where often is fixed at 1 and the mixing parameters are gamma distributed (e.g. Jewell, 1982; Abbring and van den Berg, 2007). We refer to the case with = 1 as the Rate Mixtures of Exponentials (RME) family (denoted by Ti ⇠ RMEP (↵, ✓)). The RME case can be extended to the RMW family via a power transformation 1/ (if Ti ⇠ RMEP (↵, ✓) then Ti ⇠ RMWP (↵, , ✓)). If  1, the hazard rate induced by the mixture decreases regardless of the mixing distribution (Marshall and Olkin, 2007). For > 1, it has a more flexible shape and can accommodate non-monotonic behaviour. The mixing distribution can, in principle, correspond to any proper probability distribution. However, some restrictions are required for identifiability reasons. The following theorem provides some identifiability conditions for (↵, , ✓), which will be imposed for inference throughout. In particular, the use of (separate) unknown scale parameters for the mixing distribution is precluded. This can be achieved by either fixing its scale parameter or by imposing the restriction E(⇤i |✓) = 1. We use the latter for gamma mixing, since it leads to better properties of the MCMC sampler for posterior inference in Subsection 3.3. For the other mixtures explored here, the sampler performs better if we fix the scale of the mixing distribution. Theorem 1. Let Ti ⇠ RMWP (↵, , ✓) as in (1). (↵, , ✓) is identified by the distribution of Ti if and only if: (i) E(⇤i |✓) is finite and (ii) (↵, ✓) is identified by the distribution of ↵⇤i . Random variables in the RMW family do not necessarily have finite moments of any order and the existence 1/ of finite moments is linked to the moments of ⇤i . Theorem 2. Let Ti ⇠ RMWP (↵, , ✓). The r-th moment of Ti (r 0) is finite if and only if E⇤i (⇤i r/ 1. If it exists, it corresponds to (1 + r/ ) ↵ r/ E⇤i (⇤i |✓).

r/

|✓)
1

1

↵(↵ti + 1)

0

1

2

3

t

4

5

6

0

1

2

3

t

4

5

6

Figure 1: Density and hazard function (left and right panels, respectively) of some RME models (↵ = 1). The solid line is the exponential(1) density (hazard).

r Var (⇤ 1 |✓) The expression in (3) simplifies to 2 2 ⇤i i1 + 1 when E⇤i (⇤i |✓)

= 1. Corollary 1 indicates that cv( , ✓) 1/

is an increasing function of cv ⇤ ( , ✓), which is the coefficient of variation of ⇤i given ✓. In addition, W for the same value of , the coefficient of variation of the Weibull distribution cv ( ) is a lower bound for cv( , ✓) and they are equal if and only if ⇤i = 0 with probability 1. Therefore, evidence of unobserved heterogeneity can be quantified in terms of the ratio Rcv ( , ✓) =

cv( , ✓) , cv W ( )

(4)

defined as the inflation that the mixture induces in the coefficient of variation (w.r.t. a Weibull model with the same ). If ✓ is such that cv ⇤ ( , ✓) goes to zero, then Rcv ( , ✓) tends to one and the mixture reduces to 4

Density function (γ=0.5)

0

1

2

t

3

0.6

θ=1 θ=5

f(t)

0.0

0.6

θ=1 θ=5

f(t)

0.0

Density function (γ=2)

4

0

2

t

3

4

6

Hazard function (γ=2) θ=1 θ=5

h(t)

2 0

0

2

h(t)

4

θ=1 θ=5

4

6

Hazard function (γ=0.5)

1

0

1

2

t

3

4

0

1

2

t

3

4

Figure 2: Some RMW models (↵ = 1). The mixing distribution is gamma(✓, ✓) (exponential(1) for ✓ = 1). The solid line is the Weibull(1, ) density and hazard function.

the underlying Weibull model itself. If ! 0, cv W ( ) and, consequently, cv( , ✓) become unbounded. In p that case, Rcv ( , ✓) behaves as [cv ⇤ ( , ✓)]2 + 1. If = 1, then Rcv ( , ✓) = cv(1, ✓). Throughout the rest of this paper, we restrict the range of ( , ✓) such that cv is finite (this restriction is not required when ✓ does not appear). This facilitates the implementation of Bayesian inference (see Section 3). Heckman and Singer (1984b) remark that inference is sensitive to the mixing distribution and thus use non-parametric mixing. Non-parametric mixtures of Weibull distributions (mixing on both parameters) are studied using a Bayesian approach in Kottas (2006). However, non-parametric mixing might not be appropriate for moderate sample sizes. We opt for a fully parametric approach and the adequacy of a particular mixing distribution is evaluated using Bayesian model comparison tools. This is a compromise between the standard model (⇤i = 0 with probability one) and fully flexible non-parametric mixing. The survival function of random variables in the RMW family is the Laplace transform of the mixing density evaluated in ↵ti (Wienke, 2010). Hence, mixing densities with known Laplace transform, e.g. the Power Variance Function (PVF) family (Wasinrat et al., 2013), are an attractive choice. The positive stable distribution is a limiting case of the PVF family (Wienke, 2010) and the resultant model is the Weibull distribution itself. Other examples in this family are the gamma and the inverse Gaussian distributions. The gamma distribution is perhaps the most popular choice because of the simple analytical expressions. Abbring and van den Berg (2007) also gives an asymptotic argument for this choice. If = 1, gamma(✓, 1) mixing generates the Lomax model (Lomax, 1954) which is widely used as a heavy tailed distribution. Some mixing distributions (e.g. log-normal) do not lead to analytical expressions for the resulting density. In those cases, Bayesian inference can be conducted using data augmentation and the hierarchical representation (2). Instead, a maximum likelihood analysis can be implemented by means of a EM algorithm. Table 1 displays some examples in the RME family and this list can be extended by selecting other mixing distributions. All these examples generalize to the RMW case via the power transformation mentioned earlier. Figure 1 shows the RME densities produced by these examples for different values of ✓. The density is decreasing (like in the exponential case) but the tail behaviour is very flexible. Figure 1 also illustrates that the hazard function decreases over time but that its gradient varies among the different mixing distributions 5

(see also Marshall and Olkin, 2007). Figure 2 illustrates the effect of a gamma(✓, ✓) mixing (which generates a reparametrized version of the Lomax distribution) for RMW models. Whereas the shape of the density function was not greatly affected in this example, the effect on the hazard rate is more pronounced. For instance, while the hazard rate of the Weibull is an increasing function of ti when = 2, the hazard of the mixture exhibits non-monotonic behaviour.

2.2

A regression model for the RMW family

AFT and PH models are widely used in applied survival analysis. A Weibull regression can be equivalently written in terms of both specifications. Let xi be a vector containing the value of k covariates associated with the survival time i and 2 Rk be a vector of parameters. In the RMW-AFT model, the covariates affect the time scale through the parameter ↵. This model is defined as Ti ⇠ RM WP (↵i , , ✓),

log(Ti ) = x0i + log(⇤i

1/

↵i = e

x0i

, i = 1, . . . , n, or equivalently,

(5)

T0 ), with ⇤i |✓ ⇠ P⇤i (✓) and T0 | ⇠ Weibull(1, ), i = 1, . . . , n.(6)

The RMW-AFT is itself an AFT model with baseline survival function given by the distribution of T00 = 1/ ⇤i T0 , T00 |✓ ⇠ RMWP (1, , ✓). Under this model, e j can be interpreted as the proportional marginal change of the lifetime distribution percentiles (e.g. median) after a unit change in covariate j. For ⇤ = , (5) is equivalent to the PH-RMW model with hazard function h(ti |

⇤

, , ⇤i =

i ; xi )

=

i

ti

1 x0i

e

⇤

,

⇤i ⇠ P (⇤i |✓),

i = 1, . . . , n.

Such models are also known as mixed PH models and are popular in econometrics (e.g. Heckman and Singer, 1984a). Even though the PH-RMW model is a mixture of PH models, the PH assumption is generally not preserved. Only the positive stable mixing distribution retains this property (Wienke, 2010). In the PH⇤ RMW model, e j is interpreted as the proportional marginal change of the hazard rate after a unit change in covariate j at an individual level (conditional on i ). Unlike for the RMW-AFT model, this interpretation cannot be extended to the population level. Most of the earlier literature for unobserved heterogeneity is in terms of the PH model. Nevertheless, here we will present results in terms of the RMW-AFT presentation since the interpretation of the regression coefficients is clearer and the mixture model is still an AFT model.

3 3.1

Bayesian Inference for the RMW-AFT model The prior

First, we define a prior for the RME-AFT model (i.e. fixing = 1). In the absence of prior information, a popular choice is to use priors based on the Jeffreys rule, which require the Fisher information matrix (FIM). Theorem 3. Let T1 , . . . , Tn be independent random variables distributed according to (5) with X = (x1 · · · xn )0 . The FIM corresponds to ! k1 (✓)X 0 X k2 (✓)X 0 1n I( , ✓) = , k2 (✓)10n X nk3 (✓)

= 1 and

where k1 (✓), k2 (✓) and k3 (✓) are functions of only ✓ (see Appendix) and 1n is a column vector of n ones. 6

Corollary 2. Under the assumptions of Theorem 3, assume also that X has rank k (n > k) and ✓ is a scalar parameter. The Jeffreys prior and the independence Jeffreys prior (which deals separately with the blocks for and ✓) for the RME-AFT model are, respectively J

⇡ ( , ✓) /

k/2 1/2 k1 (✓)k3 (✓) 1/2

⇡ I ( , ✓) / k3 (✓).



1

k22 (✓) 10 X(X 0 X) nk1 (✓)k3 (✓) n

1/2 1

X 0 1n

,

These two Jeffreys-style priors can be expressed as (7)

⇡( , ✓) / ⇡(✓),

where ⇡(✓) is the part of the prior that depends on ✓. Although Corollary 2 provides a general structure for these priors, the actual expressions are not easily derived (even for a particular mixing distribution). One alternative is to compute the FIM directly from the resultant density. For example, in the simple case of a gamma(✓, 1) mixing distribution the Jeffreys and independence Jeffreys priors are, respectively 

✓ ⇡ ( , ✓) ⌘ ⇡ (✓) / ✓+2 1 ⇡ I ( , ✓) ⌘ ⇡ I (✓) / . ✓ J

J

k/2

 1 1 ✓

✓(✓ + 2) 0 1 X(X 0 X) n(✓ + 1)2 n

1

X 0 1n

1/2

and

Even though this is one the simplest RME models, ⇡ J (✓) is very involved, depending on k, n and X. For other mixtures, these priors become more complicated (already if we use gamma(✓, ✓) mixing instead) and have no easy derivation. If the resultant distribution does not have a closed analytical form (e.g. with lognormal mixing), computing the FIM is very challenging. In addition, there is no guarantee of having a proper prior for ✓ when using an arbitrary mixing distribution. For instance, in the case above, ⇡ J (✓) and ⇡ I (✓) are not proper densities (both behave as 1/✓ for large ✓). As the role of ✓ is specific to each mixture, improper priors for ✓ will not allow the comparison between models in the RME family using Bayes factors. Table 2: Relationship between cv and ✓ for some RME models. ⇥ = (0, 1), unless otherwise stated. Mixing density

Range of cv

Gamma(✓, ✓), ✓ > 2

(1, 1) p (1, 3) p (1, 5)

Inverse-gamma(✓, 1) Inverse-Gaussian(✓, 1) Log-normal(0, ✓)

dcv(✓) d✓

cv(✓) q

q

q

✓

1/2

(✓

2)

3/2

✓+2 ✓

✓

3/2

(✓ + 2)

1/2

5✓ 2 +4✓+1 ✓ 2 +2✓+1

p

(1, 1)

✓ ✓ 2

2 e✓

1

3✓+1 (5✓ 2 +4✓+1)1/2 (✓+1)2 ✓ ✓ 1/2

e (2 e

1)

To overcome these issues, we propose a simplification of the Jeffreys-style priors. We keep the structure in (7) but assign a proper ⇡(✓). The comparison between models is meaningful if, regardless of the mixing distribution, ⇡(✓) reflects the same prior information (i.e. the priors are “matched”). We achieve this by exploiting the relationship between ✓ and cv, the coefficient of variation of the survival times. A proper prior, which is common for all models, is proposed for cv and denoted by ⇡ ⇤ (cv). As cv does not involve 7

Log−normal(0,θ) mixing

8

40 30

cv(γ, θ)

20 10

0

2

4

θ

6

8

0

2

4

6

θ

8

0

0

0

0

1

10

2

2

20

3

4

4

cv(γ, θ)

cv(γ, θ)

30

cv(γ, θ)

5

6

40

6

7

Inv−Gaussian(θ,1) mixing 50

Inv−Gamma(θ,1) mixing

50

Gamma(θ,θ) mixing

0

5

10

θ

15

20

0

2

4

θ

6

8

Figure 3: Relationship between ( , ✓) and cv for some RMW models. Solid, dashed and dotted lines are for = 0.5, 1 and 2, respectively. Dashed lines indicate the relationship between ✓ and cv for RME models.

Table 3: cv ⇤ ( , ✓) and its derivative w.r.t. ✓ for some RMW models. ⇥ = (0, 1), unless otherwise stated and (·) is the digamma function. Mixing Gamma(✓, ✓), ✓ > Inv-gamma(✓, 1) Inv-Gaussian(✓, 1)

d[cv ⇤ ( ,✓)]2 d✓ (✓) (✓ 2/ ) 2 (✓ 1/ ) (✓) (✓+2/ ) 2 (✓+1/ )

[cv ⇤ ( , ✓)]2 2

q

✓⇡ 2

e

(✓) (✓ 2/ ) 1 2 (✓ 1/ ) (✓) (✓+2/ ) 1 2 (✓+1/ ) K (1/✓) 2 1 ( + ) 1 ✓

2

K2

(1/✓) (1+1) 2

[ (✓) + (✓

e✓/

1/ )]

3/2

2K ( 2 + 1 ) (1/✓)K ( 1 2 Log-normal(0, ✓)

2 (✓

[ (✓) + (✓ + 2/ ) 2 (✓ + 1/ )]  1 ✓ e ✓ 1 K ( 2 + 1 ) (1/✓)K ( 1 + 1 ) (1/✓) 3 2 K (1/✓) 2 2 ( 1 + 12 ) +K ( 1 + 1 ) (1/✓)K ( 2 1 ) (1/✓) p⇡

2

2

2/ )

1

1

2

e✓/

2

1 2

) (1/✓)

2

(expression (3) does not involve ↵), ⇡ ⇤ (cv) only provides information about ✓. Using (3), the functional relationship between cv and ✓ for some distributions in the RME family is derived (see Table 2). The inverse function of cv(✓) must exist (cv(✓) must be injective), yet an explicit expression is not required. Injectivity holds for all the examples in Table 2 (cv(✓) is a monotone function of ✓), as illustrated by Figure 3. The induced prior for ✓ is then easily derived by a change of variable. When comparing a model with ✓ to models without ✓, meaningful results derive from the fact that the prior on ✓ is reasonable. Two natural choices for ⇡ ⇤ (cv) are the truncated exponential and Pareto type I distributions (both on (1, 1)) with hyper-parameters a and b, respectively. These priors cover a wide set of tails for cv. Smaller values of a and b assign larger probabilities to small values of cv (we restrict b > 1 in order to have a finite expectation for cv). These hyper-parameters can be elicited e.g. using the mean of cv. The expected values under these priors are 1 + 1/a and b/(b 1) respectively, which are equal for b = a + 1. When the range of cv differs from (1, 1) (e.g. with an inverse gamma and inverse Gaussian mixing distribution), these priors can be adjusted by truncating ⇡ ⇤ (cv). If the values of a and b are such that the prior expectation of cv falls outside the range allowed by a specific model, the prior is deemed to be inconsistent with that model. For example, the RME p p model with inverse Gaussian mixing should be discarded a priori if a < ( 5 1) 1 and b < 1+( 5 1) 1 . For a general RMW-AFT model (unknown ), the structure of the FIM is more involved than the one in Theorem 3. Thus, Jeffreys-style priors are not easy to obtain. As an alternative, we define 8

⇡( , , ✓) / ⇡( , ✓) ⌘ ⇡(✓| )⇡( ),

(8)

where ⇡(✓| ) and ⇡( ) are proper density functions for ✓ and , respectively. This extends the structure in (7) and implies a flat prior for . The product structure between and ( , ✓) in (8) is reasonable in our RMW-AFT model where the interpretation of does not depend on or ✓. Conditional on , we define ⇡(✓| ) as in the RME-AFT case (via a prior for cv, ⇡ ⇤ (cv)). Using cv( , ✓) and cv ⇤ ( , ✓) as defined in (3): ⇡(✓| ) = ⇡ ⇤ (cv( , ✓))

dcv( , ✓) dcv( , ✓) , where = d✓ d✓

(1 + 2/ ) 1 d[cv ⇤ ( , ✓)]2 . 2 (1 + 1/ ) 2cv( , ✓) d✓

Table 3 shows [cv ⇤ ( , ✓)]2 and its partial derivative w.r.t. ✓ for the mixing distributions used in Table 2. Although some of these expressions are complicated, they can easily be evaluated numerically. Figure 3 shows the relationship between ( , ✓) and cv for some RMW models. As in the RME case, truncated exponential and Pareto type I priors for cv (given ) are proposed. These are truncated to (cW v ( ), 1) (see (3)) but, as h 2 i1/2 ) with RME models, some mixing distributions impose a finite upper bound for cv ( 4 (1+2/ 1 and (1+1/ ) i1/2 hp ) (2/ +1/2) ⇡ 2(1+2/ 1 for inverse gamma and inverse Gaussian mixing, respectively). (1+1/ ) 2 (1/ +1/2)

A proposal for ⇡( ) is not trivial. A conjugate prior for in (0, 1) does not exist (Soland, 1969). A discrete prior for is conjugate but restrictive and inappropriate in most real situations (especially where no prior information about is available). Alternatively, Berger and Sun (1993) suggested the use of continuous log-concave priors for . Here, a (not always log-concave) gamma prior is used for .

3.2

The posterior

Censoring is a common feature in survival datasets, which must be taken into account. We assume noninformative censoring. The following proposition states that adding censored observations cannot destroy posterior propriety (and this applies to any survival model). Proposition 1. Let t = (t1 , . . . , tn )0 be the recorded survival times of n independent individuals, realizations of random variables with distribution as in (5). Without loss of generality, assume only the first no observations are uncensored (no  n) and denote by to the vector containing all uncensored observations. A sufficient condition for the existence of ⇡( , , ✓|t) is the propriety of ⇡( , , ✓|to ). The following Theorem covers posterior propriety for the RMW-AFT model under the improper prior in (8) on the basis of the non-censored observations (using n instead of no for ease of notation). Theorem 4. Let T1 , . . . , Tn be the survival times of n independent individuals distributed as in (5). We observe survival times t1 , . . . , tn and define X = (x1 · · · xn )0 . Assume that n k, X has rank k (full rank) and that the prior for ( , , ✓) is proportional to ⇡( , ✓), which is a proper density function for ( , ✓). If ti 6= 0 for all i = 1, . . . , n, the posterior distribution of ( , , ✓) is proper. As mentioned in Section 3.1, we use a proper prior for ( , ✓) so that Theorem 4 assures us of a proper posterior distribution if X has full column rank and there are no zero observations of the survival time. 9

Posterior propriety can be precluded when conditioning on a particular sample of point observations which has zero Lebesgue measure (Fernández and Steel, 1998, 1999; Vallejos and Steel, 2014). However, point observations do not affect the posterior propriety for the RMW-AFT model. In this case, the posterior distribution is well-defined as long as there are no individuals for which ti = 0. Whereas the latter is a reasonable assumption in most real applications, survival times can be recorded as zero due to rounding. In such a case, the point observation can be replaced by a set observation (0, ✏), where ✏ stands for the minimum value that the recording mechanism detects (equivalent to a left censored observation on (0, ✏)).

3.3

Implementation

We assume right-censoring, which is common in survival data, and conduct Bayesian inference for the RMW-AFT model under the prior presented in Section 3.1. Mixing parameters are handled through data augmentation and we implement an adaptive Metropolis-within-Gibbs sampler with Gaussian random walk proposals (see Section 3 in Roberts and Rosenthal, 2009). As the Weibull survival function has a known simple form, we do not use data augmentation for dealing with censored (and set) observations (as in Ibrahim et al., 2001; Kottas, 2006). The full conditionals for the Gibbs sampler are ⇡( j |

j,

, ✓, , t, c) /

⇡( | , ✓, , t, c) / ⇡(✓| , , , t, c) / ⇡( i | , , ✓,

i , t, c) /

e

j

Pn

Pn

i=1 ci xij

i=1 ci

n Y i=1 ci i

"

n Y

tci i

i=1

e #

Pn

i=1

1

e

i (ti

e

Pn

x0i

)

0 i=1 ci xi

, j = 1, . . . , k, e

Pn

i=1

i (ti

e

x0i

)

⇡(✓| )⇡( ),

dP ( i |✓)⇡(✓| ), e

i (ti

e

x0i

)

dP ( i |✓), i = 1, . . . , n,

where j = ( 1 , . . . , j 1 , j+1 , k ), i = ( 1 , . . . , i 1 , i+1 , n ) and the ci ’s, i = 1, . . . , n are censoring indicators equal to 1 if the survival time for individual i is observed and 0 if it is censored. For a general mixing distribution, Metropolis updates are required in all full conditionals. Nevertheless, Gibbs steps can be used for particular mixing distributions. For instance, the first four mixing distributions in Table 0 0 1, respectively, lead to gamma(1 + ci , 1 + (ti e xi ) ), gamma(✓ + ci , ✓ + (ti e xi ) ), Generalized Inverse 0 0 Gaussian( ✓ + ci , 2, 2(ti e xi ) ) and Generalized Inverse Gaussian(ci 1/2, 1, ✓ 2 + 2(ti e xi ) ) full conditionals for i (the Generalized Inverse Gaussian is parametrized as in Devroye, 1986, p. 478). We observed poor mixing of the chain for the log-normal(0, ✓) mixture. This relates to a strong a priori correlation between and ✓, which persists when not much can be learned about ✓ (as ✓ controls the tails of the distribution, this is especially problematic for small n and/or high proportion of censoring). We opt for a re-parametrization of this model from (✓, ) to (✓⇤ , ), where ✓⇤ = ✓/ 2 . As in the original ⇤ parametrization, a prior for ✓⇤ can be induced via a prior for cv (where [cv ⇤ ( , ✓⇤ )]2 equals e✓ 1). This new parametrization is more orthogonal and substantially improves the mixing of the chain. Further details on the implementation and the freely available R code can be found in the supplementary material.

10

3.4

Model comparison

The adequacy of a particular mixing distribution is evaluated using standard Bayesian model comparison criteria: Bayes factors (BF), deviance information criteria (DIC), conditional predictive ordinates (CPO) and pseudo Bayes factors (PsBF). The BF between two models is the ratio between the marginal likelihoods, which are computed using the method in Chib and Jeliazkov (2001). The DIC (Spiegelhalter et al., 2002) provides a fast tool for model comparison. It is defined as DIC ⌘ E(D( , , ✓, t)|t) + pD , where ˆ t) (effecD( , , ✓, t) = 2 log(f (t| , , ✓)) (deviance function) and pD = E(D( , , ✓, t)|t) D( ˆ, ˆ , ✓, tive number of parameters) with ˆ,ˆ and ✓ˆ being the posterior medians of , and ✓, respectively. DIC is computed using the marginal model (after integrating the i ’s), and lower values suggest better models. CPO (Geisser and Eddy, 1979) is an indicator of predictive ability. For observation i, CPOi is defined as  ✓ ◆ 1 1 CPOi = f (ti |t i ) = E , t i = (t1 , . . . , ti 1 , ti+1 , . . . , tn ), (9) f (ti | , , ✓) where the expectation is w.r.t. ⇡( , , ✓|t) and f (·|t i ) is the predictive density given t i . If ci = 0, the survival function S(·|t i ) is used instead of f (·|t i ) (as in Banerjee et al., 2007). Larger CPO values are Q preferred. We also use the pseudo marginal likelihood PsML = ni=1 CPOi (Geisser and Eddy, 1979). Analogously to BF, PsBF is computed as the ratio between the PsML associated with two models.

3.5

Detection of influential observations and outliers

Mixing tends to reduce the number of influential observations. Evidence of influential observations is measured using KLi =KL(⇡( , , ✓|t), ⇡( , , ✓|t i )), where KL(·, ·) is the Kullback-Leibler divergence (Pengi h p and Dey, 1995). As in McCulloch (1989), we use the calibration index pi = 0.5 1 + 1 exp{ 2KLi } (pi 2 [0.5, 1]). A large value of pi suggests that observation i is influential. Intuitively, outliers are linked to unusual values of the i ’s (West, 1984). As in Vallejos and Steel (2014), outliers are detected based on the posterior distribution of the i ’s. For each observation i, we compute the BF for the model M0 : ⇤i = ref versus M1 : ⇤i 6= ref (all other ⇤j , j 6= i free), where ref is a reference value (specific to the mixing distribution). The BF in favour of M0 versus M1 is computed using a generalized Savage-Dickey density ratio (Verdinelli and Wasserman, 1995) which is defined as (i) BF01

= ⇡( i |t, c)E

✓

1 dP ( i |✓)

◆

=E i = ref

✓

⇡(ti | , , ✓, i , ci )dP ( i |✓) ⇡(ti | , , ✓, ci )

◆ ✓ E

1 dP ( i |✓)

◆

,

(10)

i = ref

where the expectations are w.r.t. ⇡( , , ✓|t, c) and ⇡(✓|⇤i = ref , t, c), respectively. The latter is compu(i) tationally intensive: for each BF01 , we need to run a sub-chain fixing i = ref . However, if ✓ does not appear in the model, the sub-chains are not required as (10) reduces to the usual Savage-Dickey density ratio ✓ ◆ ⇡( i |t, c) ⇡(ti | , , i , ci ) (i) BF01 = =E , (11) dP ( i ) i = ref ⇡(ti | , , ci ) i = ref with the expectation is with respect to ⇡( , |t, c). Here, ⇡(ti | , , i , ci ) and ⇡(ti | , , ci ) represent conditional density (or survival if ci = 0) functions of ti when conditioning or not on i , respectively. This methodology relies on the choice of a reasonable ref . Vallejos and Steel (2014) used ref = E(⇤i |✓) arguing that, in the absence of unobserved heterogeneity, the posterior density of the frailty terms should 11

Inv−gamma(2.5,1) mixing

0.5

1.0

10 5

2*log(BF)

1.5

0.0

0.5

1.0

Log−normal(0,1) mixing

0.0

0.5

1.0

5

Observed Censored

0

5

Observed Censored 1.5

1.5

10

Inv−Gauss(1,1) mixing 2*log(BF)

|zi|

10

|zi|

0

2*log(BF)

0.0

Observed Censored

0

10 5

Observed Censored

0

2*log(BF)

Gamma(2.5,2.5) mixing

0.0

0.5

|zi|

1.0

1.5

|zi|

Figure 4: 2 ⇥ log-Bayes factor for outlier detection as a function of |z| in RMW-AFT models. The horizontal line is the threshold above which observations will be considered outliers (Kass and Raftery, 1995).

behave as a Dirac function with a spike on E(⇤i |✓). In our context, this is always well-defined because E(⇤i |✓) is required to be finite for the identification of (see Section 2.1). Table 1 displays E(⇤i |✓) for the examples in this article. When ✓ is unknown, we replace it by its posterior median (based on the MCMC sample). Unlike in Vallejos and Steel (2014), empirical evidence does not support the latter choice for the censored observations. Only a lower bound of the survival time is known for right-censored observations, which is highly informative for the i ’s (as they are linked to the scale of the underlying distribution). Hence, the posterior distributions of the i ’s linked to right-censored observations are driven towards lower values. We propose to keep oref = E(⇤i |✓) as the reference value for non-censored observations and adjust it by the effect of censoring for censored observations as follows: c ref

= Ci ( , , ✓)

o ref ,

with Ci ( , , ✓) =

E(⇤i |ti , ci = 0, , , ✓) . E(⇤i |ti , ci = 1, , , ✓)

For exponential mixing Ci ( , , ✓) = 1/2 and Ci ( , , ✓) = ✓/(✓ + 1) for gamma mixing (see⇣ the condi⌘ 0 tionals in Subsection 3.3). In these cases Ci ( , , ✓) does not depend on i, or . Let t⇤i = ti e xi and Kp (·) is the modified Bessel function. If ⇤i |✓ ⇠ inv-gamma(✓, 1) or ⇤i |✓ ⇠ inv-Gaussian(✓, 1), p ⇤ p 2 K1/2 2ti + ✓ 2 K 2 ✓+1 2 t⇤i p p p ⇤ p ⇤ Ci ( , , ✓) = or Ci ( , , ✓) = , K ✓+2 2 t⇤i K ✓ 2 t⇤i K 1/2 2ti + ✓ 2 K3/2 2ti + ✓ 2 respectively. For the log-normal mixing distribution Ci ( , , ✓) has no closed form but can be estimated via numerical integration. The performance of this choice has been validated using simulated datasets. (i)

To illustrate our outlier detection method, Figure 4 displays BF01 as a function of a standardized observation zi . Following the structure in (6), this is defined in terms of log(ti ) minus its mean, divided by its standard deviation (given , and ✓). Let (·) be the digamma function. As log(T0 ) ⇠ Gumbel(0, 1 ), we have " # log(ti ) x0i + 1 (E⇤i (log(⇤i )|✓) + (1)) p zi = . Var⇤i (log(⇤i )|✓) + ⇡ 2 /6 12

(i)

In terms of zi , BF01 does not depend on nor (the full conditional of ⇤i depends on ti only through t⇤i ). Naturally, outliers relate to large values of |zi |. The threshold on |zi | at which an observation is detected as outlier depends on ✓. For example, the RMW model with gamma(✓, ✓) mixing tends to the Weibull model as ✓ ! 1 and thus, the model with larger ✓ requires larger |zi | values to distinguish it from the Weibull. As shown in Figure 4, the correction factor Ci ( , , ✓) induces a similar outlier detection threshold (in terms of |zi |) for censored and non-censored observations.

4 4.1

Applications Autologous and Allogeneic Bone Marrow Transplant

This dataset (Klein and Moeschberger, 1997) contains post-surgery information about 101 advanced acute myelogenous leukemia patients. The endpoint of the study is the disease-free survival time, i.e. until relapse or death (in months). The disease-free survival time was observed for 50 patients while the others are rightcensored. In the trial, 51 patients received an autologous bone marrow transplant. This replaces the patient’s marrow with their own marrow after the application of high doses of chemotherapy. The median of the time to follow-up (relapse, death or censoring) is equal to 13.06 months (first and third quartiles are 6.07 and 18.42 months, respectively). The rest of the patients received an allogeneic bone transplant, in which their marrow was replaced by the one extracted from a sibling. The median time to follow-up is 11.81 months (first and third quartiles are 3.61 and 31.88 months, respectively) for this group. Only the type of treatment is available as a covariate and thus an important amount of unobserved heterogeneity is expected. The standard graphical check of log( log(S(t))) versus t (not reported) suggests that the PH assumption does not hold. The data is first analyzed using exponential and Weibull AFT models (which have equivalent PH representations). If ⇠ gamma(4,1) e.g., the BF in favour of the Weibull model with free (w.r.t. the exponential one) is 4.39, suggesting 6= 1. In line with this, the posterior median of is 0.69 (95% HPD: (0.53,0.85)). In addition, RME and RMW-AFT models with the mixing distributions in Table 1 are fitted using the priors proposed in Subsection 3.1. In contrast to the Weibull case, there is evidence in favour of = 1 in all the RMW-AFT regressions. For example, for the exponential(1) mixing and ⇠ gamma(4,1), the BF in favour of the RME specification ( = 1) vs. RMW is 22.01. In this case, the posterior median of is 0.86 (95% HPD: (0.67,1.07)). These opposite conclusions are linked to the fact that the Weibull model tends to underestimate in an attempt to capture the over-dispersion of the data (the cv of the Weibull is a decreasing function of ). Based on this evidence, RME-AFT models are used for these data. We adopt E(cv) equal to 1.25, 1.5, 2, 5 and 10 (if there is no ✓ in the model, all these priors coincide). Large values of E(cv) are associated with stronger prior beliefs about the existence of unobserved heterogeneity (see p (3)). Nevertheless, as explained in Section 3.1, if E(cv) is larger than 3, the model generated by inverse gamma mixing is not compatible with the prior beliefs. The same occurs for inverse Gaussian mixing when p E(cv) > 5. The algorithm in Section 3.3 is implemented. For all models, the total number of iterations is 600,000. In the following, results are presented on the basis of 9,000 draws (after a burn-in of 25% of the initial iterations and thinning). Various convergence criteria strongly suggest convergence for the chains. The presence of unobserved heterogeneity is supported by the data. Figure 5 compares the fitted models 13

0

2

4

6

log−Bayes Factors

8

8 2

4

6

●●

0 0

2

4

6

log−Bayes Factors

8

0

2

4

6

log−Bayes Factors

8

6 4

●

Exponential Weibull RME−exp RME−gam RME−inv. gam RME−inv. Gauss RME−log normal

0

2

8 6 4 2 0

8

E(cv)=10 Log−Pseudo Bayes Factors

E(cv)=5 Log−Pseudo Bayes Factors

Log−Pseudo Bayes Factors

8 4

6

●●

0

2

4

6

● ●

E(cv)=2

2

8


E(cv)=1.5

0


E(cv)=1.25

0

2

4

6

log−Bayes Factors

8

0

2

4

6

log−Bayes Factors

8

Figure 5: Autologous and allogeneic bone marrow transplant dataset. Model comparison in terms of Bayes factors

10

0.15

0.30

Trunc. Exp. prior with E(cv)=10

2

4

6

Rcv

8

10

0.4 0.2 0.0

Frequency/Density

8

2

4

6

8

10

Rcv

Pareto prior with E(cv)=10 0.30

6

Rcv

0.15

4

⇠

Pareto prior with E(cv)=5

0.00

2

Frequency/Density

0.0

0.2

0.4

Trunc. Exp. prior with E(cv)=5

0.00

Frequency/Density

Frequency/Density

(with respect to the exponential model) and pseudo Bayes factors for the models presented in Table 1 using gamma(4,1). Unfilled and filled characters denote a truncated exponential and Pareto priors for cv, respectively.

2

4

6

8

10

Rcv

Figure 6: Autologous and allogeneic bone marrow transplant dataset. Comparison between the prior (continuous line) and the posterior distribution (histogram) of Rcv (1, ✓) (which equals cv(1, ✓)) using a log-normal mixing distribution.

in terms of BF and PsBF (w.r.t. the exponential model). For all priors considered, both criteria support all the mixture models in Table 1 over the exponential model. The Weibull model (which itself can be viewed as a mixture of exponentials provided < 1, see Jewell, 1982) is also beaten in terms of BF, which are, of course, dependent on the prior on : for example, a gamma(1,1) prior leads to more support for the

14

Weibull model while a gamma(0.001,0.001) leads to slightly less support than for the exponential. The Pseudo-BF is a predictive criterion and is virtually unaffected by these changes in prior. The similarity of both criteria for the mixture models is indicative of the fact that priors are well-matched. DIC (not reported) suggests a similar ordering between models. Despite its simplicity, the exponential mixing receives most support overall. It is easy to implement (the full conditionals of the i ’s have a known form) and does not require prior elicitation for ✓. The log-normal mixing distribution has slightly more support for large E(cv), but rather less for small E(cv). Interestingly, the popular gamma mixing is the least preferred of all mixing distributions. Despite the small sample size, there is learning about Rcv (which equals cv here). Even though the truncated exponential and Pareto priors are concentrated around small values of Rcv , its posterior distribution is shifted to the right (see Figure 6). This suggests the need for a mixture and is consistent with strong heterogeneity in the data that leads to support for the exponential mixing model (infinite cv). Whereas the choice of a prior affects inference on Rcv , the posterior distribution of (usually the parameter of interest) is more robust. The effect of the mixture models over is illustrated in Figure 7. All models suggest that there is no substantive difference between the median survival times under both treatments. However, for all considered mixing distributions and priors, the effect of the treatment ( 1 ) is less pronounced than for the exponential and Weibull models without mixing. This discrepancy is among the largest when using the exponential mixing.

5

6

Intercept

β0

●

● ●

● ●

●

4

●

●

●

● ●

● ●

3

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

None

Exp

Gamma−Trunc.Exp.

Gam−Pareto

IGam−T.E.

IGam−P

IGauss−T.E.

I.Gauss−P.

LN−Trunc.Exp

LN−Pareto

Mixing−Prior

0.0

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

−1.0

β1

Treatment: autologous

None

Exp

Gamma−Trunc.Exp.

Gam−Pareto

IGam−T.E.

IGam−P

IGauss−T.E.

I.Gauss−P.

LN−Trunc.Exp

LN−Pareto

Mixing−Prior

Figure 7: Posterior of the regression coefficients for the bone marrow transplant data using the RME-AFT model in (5) ( = 1). The prior is (7) with (if appropriate) a Trunc-Exp and a Pareto prior for cv. The first two lines (“None”) correspond respectively to the exponential and Weibull models without mixing. For models with ✓, E(cv) is 1.25, 1.5, 2,5 and 10 from left to right (only 1.25 and 1.5 for inverse gamma; 1.25, 1.5 and 2 for inverse Gaussian mixing). For the Weibull model ⇠gamma(4,1). Vertical lines represent 95% HPD intervals and dots the posterior medians. 0 : Intercept, 1 : Treatment (autologous).

No influential observations are detected for any model considered, including the exponential and Weibull models without mixture (all pi ’s are below 0.9). Figure 8 (a) illustrates the posterior behaviour of the i ’s for the RME model with exponential mixing. The outlier detection mechanism proposed in Section 3.5 does not detect outlying observations (see Figure 8 (b)). So no single observation is identified as an outlier, yet 15

λi 0 1 2 3 4 5

(a)

● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●

● ●

● ● ● ● ● ● ●

● ● ●

● ● ●

● ●

● ● ● ● ● ● ●

● ●

● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

Treatment: Allogeneic

●

● ●

● ● ● ● ● ● ● ●

● ●

● ● ●

● ● ● ●

● ● ● ● ● ● ● ● ●

●

Treatment: Autologous Patient

Bayes Factors 0.0 1.0 2.0

(b)

0

20

40

60

80

100

Patient

Figure 8: Autologous and allogeneic bone marrow transplant dataset. (a) 95% HPD interval of the

i ’s for the exponential mixing distribution. Horizontal lines at oref = 1 and cref = 1/2. Circles located at posterior medians (filled circles for censored observations). Observations are grouped by treatment and displayed in ascending order of the ti ’s. (b) Bayes Factors in favour of the model M1 : ⇤i 6= ref versus M0 : ⇤i = ref .

there is ample evidence in favour of the exponential mixture model on the basis of the entire sample.

4.2

Cerebral palsy

This dataset is a subset of the one in Hutton et al. (1994) and Kwong and Hutton (2003) and contains records of 1,549 children affected by cerebral palsy and born during the period 1966-1984 in the administrative area of the Mersey Region Health Authority. See Hutton et al. (1994) for more information. The times to follow-up (survival or censoring) are recorded in years since birth. Following Kwong and Hutton (2003), we consider the amount of severe impairments (ambulation, manual dexterity and mental ability) and the birth weight (in kg.) as predictors for the time to death. The percentage of children with 0, 1, 2 and 3 severe impairments is equal to 63%, 15%, 5% and 17% respectively. The median time to follow-up for these four categories (first and third quartiles in parenthesis) are 30.88 (26.12,38.42), 32.44 (27.09,38.96), 31.22 (23.96,38.22) and 17.91 (8.97,27.99) years, respectively. Regarding birth weight, 14%, 26% and 60% of the children were born with very low weight (less than 1.5 kg.), low weight (1.5-2.5 kg.) and normal weight (more than 2.5 kg.). The median time to follow-up for these groups are 27.37 (23.69,31.14), 29.85 (24.84,36.87) and 30.83 (24.72,38.48) years. The deaths of 242 patients were observed by the end of the observation period. The survival times of the remaining 1,307 patients are right-censored, so there is a very large proportion of censoring (84.4%) in this dataset. The data are analyzed using the RMW-AFT model defined in (5) with the mixing distributions in Table 1. For comparison, a Weibull regression is also fitted. For models without ✓ (i.e. Weibull and RMW with exponential(1) mixing), the total number of iterations is 600,000. We doubled the iterations for the remaining models, for which the chains mix less rapidly. In all cases, results are presented based on 9,000 draws (after a burn-in of 25% and thinning). Figure 9 summarizes the marginal posterior inference. Throughout, results 16

4.5

Intercept 4.0

E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5 E(cv)=1.5 E(cv)=5

●

β0

3.5

● ●

●

●

●

●

● ●

●

●

●

●

●

● ● ●

●

● ●

●

●

●

●

● ●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

2.5

3.0

●

●

None

Exp

Gamma−T.Exp.

Gam−Pareto

IGam−T.Exp.

IGam−Pareto

IGauss−T.Exp. I.Gauss−Pareto

LN−T.Exp.

LN−Pareto

Mixing−Prior

●

●

●

3.2

β1

3.6

No impairments

● ●

● ● ●

● ● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

● ●

●

●

●

●

●

●

2.8

●

None

Exp

Gamma−T.Exp.

Gam−Pareto

IGam−T.Exp.

IGam−Pareto


LN−T.Exp.

LN−Pareto

Mixing−Prior

β2

1.6 1.8 2.0 2.2 2.4

1 impairment

●

●

●

●

●

●

● ●

None

Exp

●

● ●

●

●

Gamma−T.Exp.

●

● ●

●

●

●

Gam−Pareto

●

●

●

●

●

●

IGam−T.Exp.

●

●

●

●

●

IGam−Pareto

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●


● ●

●

LN−T.Exp.

●

● ●

●

LN−Pareto

Mixing−Prior

1.2

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.8

β3

1.6

2 impairments

None

Exp

Gamma−T.Exp.

Gam−Pareto

IGam−T.Exp.

IGam−Pareto


LN−T.Exp.

LN−Pareto

Mixing−Prior

β4

−0.25

−0.10

0.00

Birth weight

●

●

● ● ● ●

None

●

●

●

●

●

Exp

●

●

●

Gamma−T.Exp.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Gam−Pareto

IGam−T.Exp.

IGam−Pareto

●


●

●

●

●

●

●

LN−T.Exp.

●

●

●

●

●

LN−Pareto

1.8

Mixing−Prior

● ●

●

● ● ●

●

●

γ

● ●

●

●

●

1.4

● ●

●

● ● ● ●

●

None

● ●

●

●

1.0

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

Exp

Gamma−T.Exp.

Gam−Pareto

IGam−T.Exp.

IGam−Pareto


LN−T.Exp.

LN−Pareto

Mixing−Prior

Figure 9: Posterior results for the cerebral palsy data using different distributions in the RMW-AFT family in (5). The prior is (8) with a gamma prior for and (if appropriate) a Trunc-Exp(a) and a Pareto(b) prior for cv. Vertical lines represent 95% HPD intervals and dots are the posterior medians. From left to right, we use a gamma(4,1), gamma(1,1) and gamma(0.01,0.01) prior for . Values of E(cv) are displayed in the top panel. 0 : intercept, 1 : no impairments, 2 : 1 impairment, 3 : 2 impairments, 4 : birth weight. Bottom panel is for .

17

are fairly insensitive to the choice of prior for (three different gamma priors), the form of the prior for ✓ (truncated exponential or Pareto) and the chosen mean of cv (1.5 or 5). The main differences relate to whether mixing is used or not. The bottom panel shows that, in all cases, is estimated to be larger than 1. In line with the results in Kwong and Hutton (2003), this suggest a non-monotone hazard shape. Like in the previous application, the Weibull model tends to underestimate in order to accommodate the variability in the data. In the AFT specification we use, e j can be interpreted as proportional changes of the median survival time, regardless of the mixture. Figure 9 shows that mixture models estimate a similar effect of the covariates. The effect of no impairments ( 1 ) is less strong than in the Weibull model, where the median survival time is increased by a median factor of approximately e3.3 ⇡ 27 for children with no impairments (w.r.t. those with 3 impairments). Under the mixture models, the same factor is roughly e3.1 ⇡ 22.

4

6

2

4

6

8 6

8

0

2

4

6

γ~Gamma(1,1)

γ~Gamma(0.001,0.001)

6

8

6 4

●

0

0 4

log−Bayes Factors

●●

2


6 4


6 4 2

●

0

2

4

6

log−Bayes Factors

8

8

8

γ~Gamma(4,1) 8

log−Bayes Factors

0

2

4 0

0

log−Bayes Factors

●●

0

2


8 6 4

8

● ●

log−Bayes Factors

2

2

8

0


●●

0

2

4

6

● ●

γ~Gamma(0.001,0.001)

2


8

γ~Gamma(1,1)

0


γ~Gamma(4,1)

0

2

4

Weibull RMWEXP RMWGAM RMWIGAM RMWIGAUSS RMWLN 6

8

log−Bayes Factors

Figure 10: Cerebral palsy dataset. Model comparison in terms of Bayes Factors and Pseudo Bayes Factors (with respect to the Weibull model) for the mixing distributions presented in Table 1. Unfilled and filled characters denote a truncated exponential and Pareto prior for cv, respectively. Upper panels use E(cv) = 1.5. Lower panels use E(cv) = 5. Legend is displayed in the last panel.

Figure 10 shows that the mixture models provide a better fit for the data and lead to better predictions. In fact, for all priors considered, all the mixture models have a better performance in terms of BF and PsBF (and thus PsML). Again, both criteria are very close. In addition, the DIC criteria (not reported) arranged models in the same order. This strongly suggests the existence of unobserved heterogeneity. This evidence is also supported by Table 4, where the posterior distribution of Rcv is concentrated away from one (results with a Pareto prior on cv are very similar). However, since Rcv measures the ratio of cv in RMW models versus the Weibull model assuming a common value of , whereas is estimated to be smaller for the Weibull model, Rcv somewhat overestimates the actual ratio of cv’s between the models. For example, 18

Table 4: Cerebral palsy dataset. Posterior medians and 95% HPD intervals of Rcv ( , ✓) (as in (4)) for RMW-AFT models under a gamma(d1 , d2 ) prior for

Bayes Factors 0.0 1.0 2.0

E(cv) Mixing Gam(✓, ✓) Inv-gam(✓, 1) 1.5 Inv-Gauss(✓, 1) Log-norm(0, ✓) Gam(✓, ✓) Inv-gam(✓, 1) 5 Inv-Gauss(✓, 1) Log-norm(0, ✓)

and a truncated exponential prior for cv. d1 = 4, d2 = 1 d1 = d2 = 1 d1 = d2 = 0.01 Med. 95% HPD Med. 95% HPD Med. 95% HPD 2.41 [1.13, 4.36] 2.34 [1.17, 4.16] 2.35 [1.20, 4.15] 1.41 [1.23, 1.55] 1.40 [1.18, 1.55] 1.41 [1.24, 1.55] 1.66 [1.43, 1.83] 1.63 [1.36, 1.84] 1.64 [1.37, 1.82] 2.30 [1.53, 2.99] 2.21 [1.41, 3.10] 2.17 [1.40, 2.85] 6.98 [1.54,19.90] 6.76 [1.59,20.00] 6.77 [1.56,19.97] 1.43 [1.25, 1.55] 1.41 [1.20, 1.55] 1.41 [1.22, 1.55] 1.68 [1.49, 1.85] 1.65 [1.43, 1.83] 1.66 [1.45, 1.85] 2.45 [1.76, 3.21] 2.37 [1.63, 3.22] 2.42 [1.64, 3.18]

0

500

1000

1500

Patient

Figure 11: Cerebral palsy dataset. Bayes Factors for the RMW-AFT model with the exponential mixture in favour of the hypothesis H1 :

i

6=

ref ,

with

o ref

= 1 and

c ref

= 1/2 under a gamma(4,1) prior for .

for gamma(✓, ✓) mixing with a truncated exponential prior for cv, d1 = 1 and d2 = 4, the actual ratio is estimated as 2.08 (while the median of Rcv is 2.41). Overall, the exponential mixing provides the best results in terms of BF, PsBF and DIC. The latter model is also the simplest model to elicit (there is no ✓) and is computationally attractive. Figure 11 presents results for the exponential mixture model on the outlier detection procedure of Subsection 3.5, which does not detect any outlying observations. Again, we have strong evidence of unobserved heterogeneity in the sample, which provides strong support for mixture models, but there are no particular single observations that could be considered clear outliers.

5

Concluding remarks

Mixtures of life distribution are proposed in order to account for unobserved heterogeneity in survival models. In particular, the family generated by mixtures of Weibull distributions with random rate parameter is explored in detail (and its special case of rate mixtures of exponentials). These mixtures are shown to induce a larger coefficient of variation than the original Weibull distribution and more flexible hazard functions. Instead of the usual mixed PH scheme adopted in this context, covariates are added via an AFT specification. As an advantage, the marginal model retains the AFT structure and the interpretation of the covariate effects 19

is invariant to the mixing distribution. This allows comparison between estimates based on different RMWAFT models (with any mixing) and those produced by any other AFT mode (in particular the Weibull AFT model). The mixing representation facilitates the choice of a prior distribution. We opt for a prior that is inspired by the Jeffreys rule, with a product structure comprising an (improper) flat prior for the regression coefficients and a proper component for the remaining parameters. In view of the clear interpretation of the covariate effects, this product structure seems a reasonable assumption. The proper part of the prior is elicited via the coefficient of variation of the survival times. Priors for different mixing distributions are matched by a common prior on this coefficient of variation, so that models can be meaningfully compared through Bayes factors. We derive simple (and easily satisfied) conditions for posterior propriety. In addition, we show that adding censored observations cannot destroy the existence of the posterior distribution. Mixture models diminish the effect that anomalous observations have over posterior inference. Nonetheless, it might be of interest to identify any outlying observations driving the unobserved heterogeneity. An outlier detection method is designed, which exploits the mixing structure and compares individual frailties with a reference level. The comparison is formalized by means of Bayes factors. Choosing a reference value is crucial. A general recommendation is presented, including a correction factor for censored observations. Both analyzed datasets provide strong evidence for unobserved heterogeneity, shown not to be a consequence of a small number of specific outliers. Mixture models are supported by the data in terms of Bayes factors and predictive performance. In particular, the use of an exponential mixture distribution leads to the overall best results in both applications. As this is also a model which is simple to elicit (there is no parameter in the mixing distribution) and quite easy to deal with numerically, we would recommend that model (for which the coefficient of variation for the survival times does not exist) as a convenient starting point for practitioners. Inference on the regression coefficients is relatively robust to the prior assumptions and even to the choice of mixing distribution. The main differences are between mixture models and Weibull models.

Appendix: Proofs Theorem 1. For (i): This follows the proof in Honoré (1990), which assumed ↵ = 1. Using l’Hopital’s rule ,✓)) twice, it can be shown that limt!0 log( log(S(t|↵, = if and only if E(⇤|✓) < 1. For (ii): S(t|↵, , ✓) log(t) is the Laplace transform of the density of ↵⇤ evaluated at t , which is unique. R1 Theorem 2. E(Tir ) exists if and only if 0 tri f (ti |↵, , ✓) dti < 1. Using Fubini’s theorem, the result is direct after using the formula for the r-th moment of the Weibull distribution. Corollary 1. Direct application of the expression for E(Tir ) provided in Theorem 2. Theorem 3. This proof consists in taking the expectation of minus the second derivatives of the likelihood for each observation and computing the FIM on the basis of the whole sample as the sum of the FIM for 0 single observations. Define Yi = e xi Ti . The functions k1 (✓), k2 (✓) and k3 (✓) are k1 (✓)

=

nETi

" ⇥R

L

e ⇥R

i Yi

i

L

i

e

(1

dP ( i |✓) ⇤2 i Yi dP ( i |✓) i Yi )

⇤2 #

20

nETi

"R

L

i

e

i Yi

R

L

3 i Yi + Yi2 dP ( i |✓) i Yi dP ( ie i |✓)

1

#

k2 (✓)

=

nETi

"R

i

L

e

2R

nETi 4 k3 (✓)

=

2 hR

6 L nETi 4 hR

i

L

i Yi

(1

e

e

i

L

e

e i

x0i

e

e

dP ( i |✓)

i Yi )

R

x0 i

R

e ⇥R ⇤2 e i Yi dP ( i |✓) L i ⇣ ⌘ x0i x0i i Ti 1 T e i i

L

e

ie

x0 i

i Ti

L

d d✓

i2 3 dP ( i |✓) 7 i2 5 i Ti dP ( i |✓)

i Ti

i

i Yi

d d✓

d d✓

dP ( i |✓)

dP ( i |✓)

d d✓

nETi

"R

#

dP ( i |✓)

i

L

R

L

e i

3 5 e

e

x0i

e

i Ti x0 i

d2 d✓ 2

i Ti

dP ( i |✓)

dP ( i |✓)

#

.

These functions do not depend on because all the terms inside the expectations depend on Ti and through Yi and the distribution of Yi does not depend on nor i.

only

Corollary 2. The proof is direct from Theorem 3 using the determinant of the FIM and its sub-matrices. Proposition 1. The likelihood contribution of censored observations is 1 FT (ti ), FT (ti ) or FT (ti2 ) FT (ti1 ) for left or interval censoring, respectively. These quantities are upper bounded by 1. So, under independence, the likelihood function is bounded by the likelihood of to . Hence, the marginal likelihood of the complete data is bounded by the marginal likelihood of these no observations and the result follows. Theorem 4. The posterior distribution of ( , , ✓) given the data is proper if and only if Z

Rk

Z

0

1Z

⇥

Z

n Rn +

n Y

ti

i=1

1

e

Pn

i=1

x0i

n Y i=1

ie

Pn

i=1

e

x0i

i ti

⇡( , ✓)

n Y i=1

dP ( i |✓) d✓ d d < 1.

(12) We use Fubini’s theorem to change the order of the integrals. We integrate with respect to , using a similar argument as in Kim and Ibrahim (2000). For ti > 0 and any value of ( , , i ) 2 Rk ⇥ R+ , fTi (ti | , , i ) is bounded by a finite constant. Therefore, the integral in (12) has an upper bound proportional to Z Z 1Z Z P P Y Y Y x0i 0 k i ti i2I xi i2I e ti 1 e ⇡( , ✓) dP ( i |✓) d✓ d d , (13) ie Rk

0

⇥

Rk+

i2I

i2I

i2I

for any I = {(i1 , . . . , ik ) : 1  i1 < · · · < ik  n}. Define the transformation U = g( ) = X ⇤ , where X ⇤ is a k ⇥ k matrix containing the i1 , . . . , ik rows of X. The rank of X ⇤ is k since X is of full rank. Therefore, g(·) is bijective with Jacobian det((X ⇤ ) 1 ) and (13) is proportional to " k Z # k Z 1Z Z k k Y Y Y 1 Y 1 k ui e ui i t i ti e e dui ⇡( , ✓) dP ( i |✓) d✓ d . i 0

⇥

Rk+

i=1

i=1

i=1

1

i=1

R1R Q Using wi = e ui , the later integral simplifies to ki=1 ti 1 0 ⇥ ⇡( , ✓) d✓ d . Therefore, if ti 6= 0 for all i 2 I and ⇡( , ✓) is a proper prior for ( , ✓), then the posterior distribution of ( , , ✓) given the data exists.

References Abbring, J. and G. van den Berg (2007). The unobserved heterogeneity distribution in duration analysis. Biometrika 94(1), 87–99. 21

Balakrishnan, N., V. Leiva, A. Sanhueza, and F. Labra (2009). Estimation in the Birnbaum-Saunders distribution based on scale-mixture of normals and the EM-algorithm. Statistics and Operations Research Transactions 33(2), 171–192. Banerjee, T., M. Chen, D. Dey, and S. Kim (2007). Bayesian analysis of generalized odds-rate hazards models for survival data. Lifetime data analysis 13(2), 241–260. Berger, J. and D. Sun (1993). Bayesian analysis for the poly-Weibull distribution. Journal of the American Statistical Association 88(424), 1412–1418. Chib, S. and I. Jeliazkov (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association 96(453), 270–281. Devroye, L. (1986). Non-Uniform Random Variable Generation. Springer. Fernández, C. and M. Steel (1998). On the dangers of modelling through continuous distribution: A Bayesian perspective. in. Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds.), New York: Oxford University Press, 213–238. Fernández, C. and M. Steel (1999). Multivariate student-t regression models: Pitfalls and inference. Biometrika 86(1), 153–167. Geisser, S. and W. Eddy (1979). A predictive approach to model selection. Journal of the American Statistical Association 74, 153–160. Heckman, J. and B. Singer (1984a). The identifiability of the proportional hazards model. The Review of Economic Studies 51(2), 231–241. Heckman, J. and B. Singer (1984b). A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica 52(2), 271–320. Honoré, B. (1990). Simple estimation of a duration model with unobserved heterogeneity. Econometrica 58(2), 453–473. Hutton, J., T. Cooke, and P. Pharoah (1994). Life expectancy in children with cerebral palsy. British Medical Journal 309(6952), 431–435. Ibrahim, J., M.-H. Chen, and D. Singha (2001). Bayesian Survival Analysis. Springer. Jewell, N. (1982). Mixtures of exponential distributions. The Annals of Statistics 10(2), 479–484. Kass, R. and A. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90(430), 773–795. Kim, S. and J. Ibrahim (2000). On Bayesian inference for proportional hazards models using noninformative priors. Lifetime Data Analysis 6(4), 331–341.

22

Klein, J. P. and M. L. Moeschberger (1997). Survival Analysis: techniques for censored and truncated data (1st ed.). Springer. Kottas, A. (2006). Nonparametric Bayesian survival analysis using mixtures of Weibull distributions. Journal of Statistical Planning and Inference 136(3), 578–596. Kwong, G. and J. Hutton (2003). Choice of parametric models in survival analysis: applications to monotherapy for epilepsy and cerebral palsy. Journal of the Royal Statistical Society, C 52(2), 153–168. Lomax, K. (1954). Business failures: Another example of the analysis of failure data. Journal of the American Statistical Association 49, 847–852. Marshall, A. W. and I. Olkin (2007). Life Distributions. Springer. McCulloch, R. E. (1989). Local model influence. Journal of the American Statistical Association 84, 473–478. Omori, Y. and R. Johnson (1993). The influence of random effects on the unconditional hazard rate and survival functions. Biometrika 80(4), 910–914. Peng, F. and D. K. Dey (1995). Bayesian analysis of outlier problems using divergence measures. The Canadian Journal of Statistics 23, 199–213. Roberts, G. and J. Rosenthal (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics 18(2), 349–367. Soland, R. (1969). Bayesian analysis of the Weibull process with unknown scale and shape parameters. IEEE Transactions on Reliability 18(4), 181–184. Spiegelhalter, D., N. Best, B. Carlin, and A. Van Der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, B 64(4), 583–639. Vallejos, C. and M. Steel (2014). Objective Bayesian survival analysis using shape mixtures of log-normal distributions. Journal of the American Statistical Association, forthcoming. http://dx.doi.org/10.1080/01621459.2014.923316. Verdinelli, I. and L. Wasserman (1995). Computing Bayes factors by using a generalization of the SavageDickey density ratio. Journal of the American Statistical Association 90, 614–618. Wasinrat, S., W. Bodhisuwan, P. Zeephongsekul, and A. Thongtheeraparp (2013). A mixture of Weibull hazard rate with a Power Variance Function frailty. Journal of Applied Sciences 13(1), 103–110. West, M. (1984). Outlier models and prior distributions in Bayesian linear regression. Journal of the Royal Statistical Society, B 46(3), 431–439. Wienke, A. (2010). Frailty Models in Survival Analysis. Chapman & Hall/CRC.

23

Supplementary material: Bayesian inference implementation details and R code for “Incorporating unobserved heterogeneity in Weibull survival models: A Bayesian approach” by C.A. Vallejos and M.F.J. Steel

Bayesian inference for the AFT-RMW model under the weakly informative priors presented in Section 3.1 is implemented via an Adaptive Metropolis-within-Gibbs algorithm with Gaussian Random Walk proposals (see Section 3 in Roberts and Rosenthal, 2009). We assume right-censoring, which is most common for survival data. Mixing parameters are handled through data augmentation (Tanner and Wong, 1987). Although the usual approach for dealing for censored (and set) observations is also by data augmentation, we do not use it for that because the Weibull survival function has a known simple form (Ibrahim et al., 2001; Kottas, 2006). Inference was implemented in R and code is available in the ’RMWcode.zip’ file. This includes the MCMC algorithm, criteria for Bayesian model comparison and a procedure for outlier detection. To save space, some details of the implementation were omitted in the article. These are explained below. Throughout, equation numbers refer to the main article. Update of In principle, can take any value in (0, 1). However, numerical problems are observed when very small values of are proposed, in which case cv W ( ) becomes very large (in fact, cv W ( ) ! 1 when ! 0). As a solution, regardless of the mixing distribution, we truncate the range of to (0.06, 1). This has no practical consequences as such small values of are very rarely required for real datasets. Update of the mixing parameters The sampler involves the update of ⇤1 , . . . , ⇤n at each step of the chain. This may be computationally inefficient (especially in situations in which sampling from the mixing variables is cumbersome). In order to mitigate this problem, the ⇤i ’s will be sampled only every Q iterations of the chain. The value of Q is chosen by considering the Effective Sample Size (ESS) of the chain and the CPU time required. Even though longer chains are required for Q > 1 (the mixing of the chains is affected), the reduction in term of CPU time can be substantial. For the inverse gamma and inverse Gaussian mixing distributions, the full conditionals of the mixing parameters are given by Generalized Inverse Gaussian distributions. The algorithm proposed in Leydold and Hörmann (2011) is used when sampling from these full conditionals, via the function rgig which is contained in the R library ghyp. Detection of outliers and influential observations For each observation, the models M0 : ⇤i = ref and M1 : ⇤i 6= ref are contrasted. Bayes factors between them can be computed as the generalized Savage-Dickey density ratio proposed by Verdinelli and Wasserman (1995) and stated in (10). When the parameter ✓ does not appear in the model, this simplifies to the original version of the Savage-Dickey ratio in (11). However, if ✓ is unknown, the procedure is computationally intensive, since a reduced run of the MCMC algorithm (in which ⇤i is fixed at ref ) is 1

required for each observation i. Nevertheless, as the n runs are independent, the process can be easily parallelized. Our R functions for outlier detection receive i as input in order to facilitate this. In a multi-core environment, each run can be sent to a different node. Evidence of influential observations is evaluated through the effect on the posterior distribution when deleting one observation at the time. This evidence is quantified by means of the Kullback-Leiber divergence function Ki = KL(⇡( , , ✓|t), ⇡( , , ✓|t i )). As explained in Cho et al. (2009) it can easily be computed from MCMC output. Occasionally, numerical issues can lead to a negative estimate of Ki . In such a situation, a warning is printed. R code The ’RMWcode.zip’ file contains code to implement the algorithms and the Bayesian model comparison and outlier detection methods discussed in the article. The code was developed in R, version 3.0.1. First, the libraries ghyp and compiler must be installed in R. The last library speeds up the “for” loops. These libraries are freely available from standard R repositories and are loaded when “Internal Codes.R” is executed. Table 1 indicates the notation used throughout the code. Table 1: Notation used throughout the R code Variable name Time Cens n k X N thin burn Q beta0 gam0 theta0 typ.theta hyp.theta hyp.gam1 hyp.gam2 ar EXP obs ref

Description Vector containing the survival times Censoring indication (1: observed, 0: right-censored) Total number of observations Number of covariates (including intercept) Design matrix with dimensions n⇥k Total number of iterations (MCMC algorithms) Thinning period (MCMC algorithms) Burn-in period (MCMC algorithms) Update period for the i ’s (MCMC algorithms, except unmixed exponential/Weibull model) Starting value for (MCMC algorithms) Starting value for (MCMC algorithms) Starting value for ✓ (MCMC algorithms, if required) Type of prior assigned to ✓. Options: TruncExp, Pareto. Hyper-parameter value for the prior assigned to ✓. Shape hyper-parameter value for the Gamma prior assigned to . Rate hyper-parameter value for the Gamma prior assigned to . Optimal acceptance rate for the adaptive Metropolis-Hastings updates (default value: 0.44) Logical indicator. If TRUE, the value of is fixed and equal to gam0 Indicates the number of the observation under analysis (outlier detection only) Reference value ref (outlier detection only)

The code is separated in two files. The file “Internal Codes.R” contains functions that are required for the implementation but the user is not expected to directly interact with these. These functions must be loaded in R before doing any calculations. The remaining functions are contained in the file “User Codes.R”. The 2

names of these functions have two components which are separated by a dot. While the first component indicates the algorithm or method implemented in the function, the second component indicates the model for which the function should be used: Weibull (WEI), RMW with exponential(1) mixing (RMWEXP), RMW with Gamma(✓, ✓) mixing (RMWGAM), RMW with inv-gamma(✓, 1) mixing (RMWIGAM), RMW with invGauss(✓, 1) mixing (RMWIGAUSS) and RMW with log-normal(0, ✓) mixing (RMWLN). In the following, a short description of these functions is provided. The use of these function is also illustrated in the file “Example.R”. • MCMC. Adaptive Metropolis-within-Gibbs sampler with univariate Gaussian random walk proposals (see Section 3 in Roberts and Rosenthal, 2009). The output is a matrix with N/thin+1 rows and the columns are the MCMC chains for (k columns), (1 column), ✓ (1 column, if appropriate), (n columns, not provided for Weibull model) and the logarithm of the adaptive variances (the number varies among models). The latter allows the user to evaluate whether the adaptive variances have stabilized. Overall acceptance rates are printed in the R console (if appropriate). This value should be close to the optimal acceptance rate ar. For the following functions, an MCMC chain generated using the function MCMC (after removing burn-in period) is required as an argument. • DIC. Deviance Information Criteria (Spiegelhalter et al., 2002). It is based on the deviance function D( , , ✓, t) = 2 log(f (t| , , ✓)) but also incorporates a model complexity penalty. DIC is defined as DIC ⌘ E(D( , , ✓, t)|t) + pD = E(D( , , ✓, t)|t) + {E(D( , , ✓, t)|t)

D(E( , , ✓|t), t)},

where pD can be interpreted as the effective number of parameters of the model. This function returns a single number which is a Monte Carlo estimate of the DIC. This is computed using the likelihood marginalized w.r.t. the mixing parameters (numerical integration via the R function integrate is used for DIC.RMWLN). The effective and actual number of model parameters are printed in the R console. • LML. Log-Marginal likelihood estimator. The marginal likelihood is defined as Z 1 Z 1Z m(t) = f (t| , , ✓)⇡( , , ✓) d d d✓, 1

0

⇥

and log(m(t)) is computed using the algorithm in Chib (1995) and Chib and Jeliazkov (2001). The output is a list containing the value of the log-likelihood ordinate, log-prior ordinate, log-posterior ordinate and log-marginal likelihood estimate. Several messages are printed in order to indicate the progress of the algorithm. The latter was developed for a non-adaptive scheme so we compute the marginal likelihoods from non-adaptive chains, using the stabilized proposal variances and started at the posterior median values of the adaptive chains. 3

• CaseDeletion. Leave-one-out cross-validation. The function returns a matrix with n rows. Its first column contains the logarithm of the CPO criterion in (9) (Geisser and Eddy, 1979; Geisser, 1993). Larger values of CPO indicate better predictive accuracy of the model. The second and third columns contain the KL divergence between ⇡( , , ✓|t i ) and ⇡( , , ✓|t) and its calibration index (Cho et al., 2009), respectively. This is used in order to evaluate the existence of influential observations. If the calibration index for observation i is much larger than 0.5, it is declared an influential observation. The logarithm of PsML can be computed as the sum of log CPO across the n observations. • BF.lambda.obs. Outlier detection for observation obs. This returns a unique number corresponding to the Bayes Factor in favour of M0 : ⇤i = ref versus M1 : ⇤i 6= ref (with all other ⇤j , j 6= i free). The value of ref is required as input (references values are provided in the paper). The user should expect longer running times for RMW models containing an unknown parameter ✓ (e.g. RMW model with gamma(✓, ✓) mixing), in which case n reduced chains (given fixed ⇤i = ref ) need to be generated (for which N, thin, burn and Q are required). • OD.Correction. Correction factor applied to the reference value for right-censored observations in the outlier detection procedure implemented in BF.lambda.obs . Convergence and mixing of the sampler Convergence of the MCMC chains was never a problem with the burn-in used for the examples displayed in Section 4. Mixing is very good for models without an extra parameter ✓ in the mixing distribution (e.g. Weibull and RMW model with exponential(1) mixing). In the presence of an unknown ✓, reliable inference is obtained through the described algorithms, but the chains are mixing a bit less well for some of the parameters, requiring MCMC run lengths of the order used here.

References S. Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90: 1313–1321, 1995. S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association, 96:270–281, 2001. H. Cho, J. G. Ibrahim, D. Sinha, and H. Zhu. Bayesian case influence diagnostics for survival models. Biometrics, 65:116–124, 2009. S. Geisser. Predictive interference: an introduction. Monographs on Statistics and Applied Probability Series. Chapman and Hall, 1993. S. Geisser and W.F. Eddy. A predictive approach to model selection. Journal of the American Statistical Association, 74:153–160, 1979.

4

J.G. Ibrahim, M-H. Chen, and D. Singha. Bayesian Survival Analysis. Springer, 2001. A. Kottas. Nonparametric Bayesian survival analysis using mixtures of Weibull distributions. Journal of Statistical Planning and Inference, 136:578–596, 2006. J. Leydold and W. Hörmann. Generating generalized inverse Gaussian random variates by fast inversion. Computational Statistics & Data Analysis, 55:213–217, 2011. G.O. Roberts and J.S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18:349–367, 2009. D.J. Spiegelhalter, N.G. Best, B.P. Carlin, and A. Van Der Linde. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B, 64:583–639, 2002. M.A. Tanner and W.H. Wong. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82:528–540, 1987. I. Verdinelli and L. Wasserman. Computing Bayes factors by using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association, 90:614–618, 1995.

5