Sequential Ordinal Modeling with Applications to Survival ... - CiteSeerX

7 downloads 0 Views 287KB Size Report
event Yi = j if a failure is observed in the interval aj?1; aj). .... model. Both the Pearson and deviance statistics have asymptotic chi-squared distributions ... Extending Albert and Chib (1993), this Bayesian sequential model can be t using latent.
Sequential Ordinal Modeling with Applications to Survival Data James Albert

Bowling Green State University, Bowling Green, USA Siddhartha Chib

Washington University, St. Louis, USA May 1998. Abstract

This paper considers the class of sequential probit models in relation to other models for ordinal data. Hierarchical and other extensions of the model are proposed for applications involving discrete time (grouped) survival data. Computationally practical Markov chain Monte Carlo algorithms are developed for the tting of these models. The ideas and methods are illustrated in detail with a real data example on the length of hospital stay for patients undergoing heart surgery. A notable aspect of this analysis is the comparison, based on marginal likelihoods and training sample priors, of several non-nested models, such as the sequential model, the cumulative ordinal model and Weibull and log-logistic models.

Keywords: Bayes factor; Discrete hazard function; Gibbs sampling; Marginal likelihood; Metropolis-Hastings algorithm; Non-nested models; Sequential probit; Training sample prior; Model comparison.

1 Introduction Ordinal response data is generally analyzed using the cumulative ordinal regression model (McCullagh, 1980). Underlying this model is a continuous latent variable that is observed in categorical form, depending on the value taken by the latent variable. In some circumstances, however, a di erent model may be more suitable in the development of an ordinal response model. For example, if one is studying the factors that determine the educational  Address for correspondence: Department of Mathematics and Statistics, Bowling Green State University,

Bowling Green, OH 43403, USA. Email addresses: [email protected]; [email protected].

1

attainment of students (and the outcomes are categorized into levels corresponding to prehigh school, high school and college attainments), it may be more appropriate to consider an ordinal model in which the outcomes are obtained through a sequential mechanism. For example, one achieves the high school attainment only if one has completed pre-high school. As another illustration, if one is interested in the length of stay of patients in a hospital, one may treat the duration of stay (an ordinal variable) as discrete, or grouped, and model the data through the speci cation of a (discrete time) hazard function. One general model for such problems is the sequential ordinal model. Suppose that one observes independent observations Y1 ; :::; Yn, where each Yi is an ordinal categorical response variable with J possible values f1; :::; J g. Associated with the ith response Yi , let xi = (xi1 ; :::; xip) denote a set of p covariates. In the sequential model, the variable Yi can take the value j only after the levels 1; :::; j ? 1 are reached. In other words, to get to the outcome j , one must pass through levels 1, 2, ..., j ? 1 and stop (or fail) in level j . The probability of stopping in level j (1  j  J ? 1), conditional on the event that the j th level is reached, is given as Pr(Yi = j jYi  j; ;  ) = F ( j ? x0i  ) ;

(1)

where = ( 1; :::; J ?1) are unordered cutpoints and x0i  represents the regression structure. This probability function is referred to as the discrete time hazard function [Tutz (1990, 1991), Fahrmeir and Tutz (1994)]. It follows that the probability of stopping at level j is given by jY ?1 0 Pr(Yi = j j ;  ) = F ( j ? xi  ) f1 ? F ( k ? x0i  )g ; j  J ? 1 ; k=1

(2)

whereas the probability that the nal level J is reached is Pr(Yi = J j; ) =

JY ?1 k=1

f1 ? F ( k ? x0i)g ;

(3)

since the event Yi = J occurs only if all previous J ? 1 levels are passed. The sequential model is useful in the analysis of discrete-time survival data. Fahrmeir and Tutz (1994), Chapter 9, show that the proportional hazards model for grouped or 2

discrete-time data is equivalent to a sequential model with an extreme-minimal value distribution F . More generally, the sequential model provides a straightforward way of modeling non-proportional and non-monotone hazard functions and time dependent covariates. Suppose that one observes the time to failure for subjects in the sample, and let the time interval be subdivided (or grouped) into the J intervals [a0 ; a1); [a1; a2); :::; [aJ ?1; 1). We de ne the event Yi = j if a failure is observed in the interval [aj ?1 ; aj ). The discrete hazard function (1) now represents the probability of failure in the time interval [aj ?1 ; aj ) given survival until time aj ?1 . The vector of cutpoint parameters represents the baseline hazard of the process. In this article we present a Bayesian analysis of the sequential probit model with particular emphasis on the issue of model comparison. Section 2 describes classical and Bayesian tting methods for the sequential model. The Bayesian methods generalize the latent variable/Gibbs sampling approach of Albert and Chib (1993). In the context of survival data, Section 3 discusses hierarchical and other generalizations of the sequential model. These models di er by the speci cation of the baseline hazard function, the choice of covariates, and the e ect of the covariates on the hazard function. Section 4 describes and outlines tting algorithms for alternative models for tting discrete time series data. These models include the basic cumulative ordinal model and continuous-time survival models with Weibull and log-logistic error structures. A marginal likelihood approach based on Chib (1995) for comparing these models is described in Section 5 and an extended analysis of the length of hospital stay dataset is presented in Section 6.

2 Fitting of the basic sequential model 2.1 Likelihood function

Suppose that the hazard function has the form (1) and let y = (y1 ; :::; yn) represent the vector of observed ordinal responses. Then, the likelihood function of the vector of cutpoints

and the regression vector is given by

L( ; ) =

Y i:yi > >< 2 Yi = > ... >> J ? 1 :J

if zi1  0 if zi1 > 0; zi2  0 .. . if zi1 > 0; :::; ziJ ?2 > 0; ziJ ?1  0 if zi1 > 0; :::; ziJ ?1 > 0: 6

(6)

To estimate this model, a Markov chain Monte Carlo sampling algorithm can be designed. One simulates from the joint posterior distribution of (fzij g; ) by sequentially sampling from the two conditional posterior distributions

fzij gj ; jfzij g: The latent data fzij g, conditional on the parameter vector , are independently distributed from truncated normal distributions. From (6) one can see that more than one latent variable must be generated for each observation, unless Yi = 1. For example, if Yi = 3 and J = 4, then we draw zi = (zi1 ; zi2; zi3) from zi jYi ; . The set-valued inverse map of the function in (6) yields [zi1jYi = 3; ]  T N (0;1)(x0i1 ; 1) ; [zi2jYi = 3; ]  T N (0;1)(x0i2 ; 1) ; [zi3jYi = 3; ]  T N (?1;0)(x0i3 ; 1) ; where T N (a;b)(;  2)denotes the normal(;  2) distribution truncated to the interval (a; b). In general, if Yi = j , then the latent data corresponding to this observation are represented as zi = (zi1 ; :::; zij ) where ji = minfj; J ? 1g and the simulation of zi is from a sequence of truncated normal distributions. Similarly, the posterior distribution of the parameter vector , conditional on values of the latent data z = (z1 ; :::; zn), has a simple form. Let Xi denote the covariate matrix corresponding to the ith subject consisting of the ji rows x0i1 ; :::; x0ij . Let X denote the matrix obtained by stacking the covariate matrices X1; :::; Xn. If we de ne the column vector z consisting of stacking the latent vectors z1 ; :::; zn, then z is distributed multivariate normal with mean vector X and identity covariance matrix. If is assigned a multivariate normal prior with mean vector 0 and variance-covariance matrix B0?1 , then the posterior ^  P distribution of , conditional on the latent data, is N ; (B0 + ni=1 Xi0Xi )?1 , where P P ^ = (B0 + ni=1 Xi0Xi)?1 (B0 0 + ni=1 Xi0zi ). To summarize, the MCMC algorithm for tting the basic sequential ordinal model where = ( ; )has the prior N ( 0; B0?1 )is described as follows. i

i

Algorithm 1: MCMC sampling for the basic sequential model 7

1. Sample z jy; by sampling zi j(Yi = j; ) for each i. If yi = 1, generate zi1 from T N (?1;0)(x0i ; 1). If yi = j (2  j  J ? 1 ), generate fzik gjk?=11 independently ?1 from T N (0;1)(x0i ; 1), and zij from T N (?1;0)(x0i ; 1). If yi = J , generate fzik gkJ=1 independently from T N (0;1)(x0i ; 1).





^ (B0 + Pni=1 Xi0Xi )?1 . 2. Simulate jy; z by sampling from the distribution Nk ;

3. Go to 1 and repeat the iterations using the most recent values of the conditioning variables. The above algorithm can be adjusted to account for censoring of the type described in Section 2.1. In step 2 of the algorithm, if the observation is censored at yi = j (di = 0), then one generates ji = j ? 1 independent variates zi1 ; :::; zij from the truncated normal distributions T N (0;1)(x0ik ; 1); k = 1; :::; j . Step 2 of the algorithm is unchanged.

3 Discrete time survival analysis and hierarchical sequential models 3.1 Modeling the hazard function

In the survival analysis setting, a number of di erent models can be proposed. These models di er by the speci cation of the baseline hazard, the choice of covariates, and the e ect of the covariates on the hazard function. The hazard of the basic model is given by Pr(Yi = j jYi  j; ;  ) = F ( j ? x0i  ): This model allows for an arbitrary shape of the baseline hazard since the cutpoints f j g are unspeci ed. In addition, the covariates xi are assumed to change the baseline hazard by the same amount across time. We refer to the covariates in this model as global covariates, since the in uence of these variables on the discrete hazard will not depend on the time interval. In some cases, the covariates may not e ect the response probabilities in a global fashion. The e ect of a particular covariate on the hazard function may be di erent across time. Suppose we subdivide the vector of covariates xi into a vector xi(1) of category-speci c 8

covariates whose e ect may depend on the particular response category and a vector xi(2) which consists of global covariates. We will refer to these models as interaction models since there is interaction present between some covariates and the time variable. In this case the hazard is represented as Pr(Yi = j jYi  j; ; 1; :::; J ?1; ) = F ( j ? x0i(1)j ? x0i(2) ):

(7)

In this model, the regression parameter  represents the e ect of the global covariates and the set fj g re ects the e ects of the category-speci c covariates. The tting of these models proceeds as in Algorithm 1 with

zij j  N (x0ij ; 1) ; where now = ( 1; :::; J ?1; 10 ; :::; J0 ?1;  0)0 and

x0ij = (0; 0; :::; ?1; 0; 0; ::: ? x0i(1); 0; :::; 0; x0i(2)) ; with ?1 in the j th column and ?x0i(1) starting in the (J ? 1 + j )th column. Finally, the matrix Xi is given by the rst ji = minfj; J ? 1g rows of the matrix

0 ?1 0    BB 0 ?1    BB . . . @ .. .. .. 0

0

0 0 .. .

?x0i

0 .. .    ?1 0

0

(1)

?x0i

(1)

0 0

0 0 .. .

1

x0i(2) x0i(2) C C C

  .. .

   ?xi

(1)

.. .

x0i(2)

C A

The above two models are suitable in the situation where there are only a few possible values of the response Yi . However, in the discrete survival context, there may be a large number of time intervals, and therefore a large number of possible response categories J . In this situation, it is desirable to model the baseline categories f j g and/or the fj g by a parsimonious function f (j ) containing a small number of parameters which re ects the general features of the discrete hazard [Mantel and Hankey (1978), Efron (1988)]. If the hazard is believed to be constant across time, this motivates the model of the form

j =  0 : 9

An example of a more exible function that could be used to model a constant, an increasing, or a decreasing hazard function is the quadratic form

j = 0 + 1 j + 2 j 2; where the parameter values  = (0; 1 ; 2) are unknown and must be estimated from the data. In the case where one is tting an interaction model, then the category-speci c coecients fj g may also be hypothesized to lie on a low-order polynomial on the category index. In particular, suppose that j is m dimensional, then one can let the lth component of j (ie., jl ) lie on a polynomial

jl = l0 + l1j + l2 j 2 with unknown parameters k0 , k1 and k2 . In vector notation these constraints can be expressed as j = Aj  ; j  J ? 1 where

?

Aj = Im 1 j j 2



is a m times 3m matrix,  = (10 ; :::; m0 )0 is a 3m vector and each l is a three by one vector consisting of (l0; l1; l2). The sequential model under polynomial restrictions on both f j g and fj g can again be t using Algorithm 1 after de ning Xi to be rst ji rows of the matrix 0 ?1 ?1 1 ?1 ?x0i(1)A1 x0i(2) 0 BB ?1 ?2 ? 2 ? x x0i(2) C C i(1) A2 BB .. .. .. .. .. C @. . . . 0 . C A 0 2 ?1 ?(J ? 1) ?(J ? 1) ?xi(1)AJ ?1 xi(2) with xij equal to the j th row of this matrix. The parameters of the resulting model are given by = (0;  0;  0)0 .

3.2 Hierarchical sequential models Two distinct representations of the discrete baseline hazard function have been described. The rst representation allows for a distinct parameter j corresponding to each observation 10

category j , and the second model restricts the set of cutpoints f j g to lie on a smaller dimension surface. One can use a hierarchical model to compromise between these two extremes. For example, the f j g can be assumed to satisfy the model

j =  0 +  1 j +  2 j 2 +  j ; where the fj g are independent normal with mean 0 and variance  2. To complete the model, the normal variance  2 can be assigned a prior g ( 2). This model will allow the data to shrink the observed hazard function towards the quadratic structure. The degree of shrinkage will depend on the compatibility of the data with this polynomial surface. In general, in the model with interaction variables, the vector = ( ; 1; :::; J ?1;  ) consisting of cutpoints , the category speci c parameters 1 ; :::; J ?1 and  may be believed to follow the hierarchical model

 N (A 0; B ?1 ); where A is a known covariate matrix, 0 is an unknown vector of smaller dimension than , and B is a patterned precision matrix with a vector of unknown parameters b. We write B (b) to emphasize the dependence of the precision matrix on b. The prior model is completed by assigning the hyperparameters ( 0; b) a vague prior at the second stage. Assume 0, b independent with 0 assigned a at prior and b the prior density g (b). This model allows one to shrink both the baseline hazard and the regression model towards a more parsimonious structure. To illustrate the method, suppose one wishes to shrink the baseline hazard, as parameterized by the vector of cutpoints , towards a lower-dimensional quadratic surface. This can be modeled by letting = ( ;  )0 be distributed N (A 0; B ?1 ), where A and B are given by A 0   ?2I 0  0 J ?1 A= 0 0 ; B = 0 0 ;

01 B1 A =B B @ ... 0

1 2 .. . 1 (J ? 1) 11

1 2 .. . (J ? 1)2

1 C C C A

and

0 = (0 ; 1; 2; 00 )0

To complete the model, the parameters 0 and  2 can be assumed independent, 0 assigned a at prior and  2 assigned an Inverse-Gamma(c; d) distribution with small values of c and d. For the general hierarchical model, a Gibbs sampling algorithm [Gelfand and Smith (1990)] can be constructed for the set (z; ; 0; b) based on standard results for normal hierarchical models. The algorithm is outlined below, giving all of the full-conditional posterior distributions. Note that steps 1 and 2 resemble the corresponding steps for the MCMC algorithm of the sequential model (Algorithm 1). The di erences are primarily in the simulation of the second-stage hyperparameters in steps 3 and 4.

Algorithm 2: MCMC sampling for the hierarchical sequential model 1. Sample z jy; ; ; b by sampling zi j(Yi = j; ) for each i. If yi = 1, generate zi1 from T N (?1;0)(x0i ; 1). If yi = j (2  j  J ? 1 ), generate fzik gjk?=11 independently ?1 from T N (0;1)(x0i ; 1), and zij from T N (?1;0)(x0i ; 1). If yi = J , generate fzik gkJ=1 independently from T N (0;1)(x0i ; 1).

^  P (B (b) + ni=1 Xi0Xi )?1 , 2. Simulate jy; z; 0; b by sampling from the distribution Nk ; where n n ^ = (B (b) + X Xi0Xi )?1 (B (b)A 0 + X Xi0zi ): i=1

i=1

3. Simulate 0jy; z; ; b by sampling from the distribution N ( ^0; (A0B (b)A)?1), where

^0 = (A0B(b)A)?1(A0 B(b) ): 4. Simulate bjy; z; ;  by sampling from the density proportional to   1 0 1=2 jB(b)j exp ? 2 ( ? A 0) B(b)( ? A 0) g(b): 5. Go to 1 and repeat the iterations using the most recent values of the conditioning variables. 12

4 Fitting of related models In this section we consider the MCMC based tting of two alternative classes of models for ordinal data. A method for comparing these models with the sequential model is discussed in Section 5.

4.1 Cumulative ordinal model The cumulative model (McCullagh, 1980) is de ned by the distribution Pr(Y  j j ;  ) = F ( jc ? x0i  ); j = 1; :::; J ;

(8)

where F () is a speci ed cumulative distribution function,  is the regression parameter vector, and c = ( 1c; :::; Jc ?1) are the category speci c cutpoints or thresholds (the super-script \c" is used to denote the cumulative model.) The cutpoints satisfy the order restriction

1c  :::  Jc ?1 to ensure that the cumulative probabilities are non-decreasing. Suppose that one is interested in analyzing the cumulative model under the assumption that F is , the cdf of the standard normal random variable . The likelihood function in this case is given by

L( c; ) =

YJ Y  c 0  ( j ? xi  ) ? ( jc? ? x0i  ) 1

j =0 i:yi =j

(9)

and the posterior density, under the prior  ( c;  ) is proportional to  ( c;  )L( c;  ). Albert and Chib (1993) showed that the posterior density could be sampled by introducing latent variables z1 ; :::; zn, one for each observation i, such that zi = xi  + "i , where "1 ; :::; "n are independently distributed from the distribution . A priori, we observe Yi = k if the latent variable zi falls in the interval [ kc?1; kc ), where we de ne 0c = ?1 and Jc = 1. Cowles (1996) modi ed the Albert and Chib algorithm by proposing that the cut-points be generated, marginalized over the z 's. The tting of the model now is based on the full conditional posterior densities

 cjy;   zijyi; ; c (i  n) and 13

 jy; fzig; c If the parameters c;  are known, the latent variables fzi g are distributed from independent truncated normal distributions, where the truncation points depend on the observed values fyi g. For xed values of the latent data the conditional posterior distribution of the regression vector  has (assuming a multivariate normal prior) a multivariate normal distribution. The most challenging sampling regards the conditional distribution of the cutpoint parameters. Because the parameters c are ordered, one can construct a real-valued parameters = ( 1 ; :::; J ?1) by means of the transformation

1 = log 1 ; j = log( j ? j?1 ) ; 2  j  J ? 1: with inverse map given by

j =

j X i=1

(10)

exp( i ) 1  j  J ? 1:

In this case, one can sample c indirectly by sampling by use of a multivariate Metropolis algorithm with a multivariate-t proposal density. The method of Cowles (1996) works on the original cut-points c and, as a result, leads to a more intricate algorithm. Suppose that the regression vector  has a normal prior distribution with mean 0 and variance-covariance matrix D0?1 . Also let

?(+J ?1)=2 1 0 ? 1 fT ( j ^; V;  ) / 1 +  ( ? ^ ) V ( ? ^) 

denote a multivariate-t density with  degrees of freedom whose mean ^ and dispersion V are the mode and negative inverse Hessian of f (y j c;  ), when viewed as a function of . Then, the MCMC sampler for the cumulative ordinal model is given as follows.

Algorithm 3: MCMC sampling for the cumulative ordinal model 1. Sample jy;  by drawing an from jy;  and then using the inverse map. To draw , rst sample a proposal value t from fT ( j ^;  2V;  ) and then move to t from the current value with probability of move  f (yj t; ) ( t) f ( j ^;  2V;  )  t m( ; ) = min f (yj ; ) ( ) f T( tj ^;  2V;  ) ; 1 T 14

2. Sample z jy; ;  by sampling zi j(yi = j; ;  )  T N [ ?1 ; ] (x0i ; 1) a truncated normal distribution on the interval ( j ?1 ; j ). j

j





P 3. Sample  jy; z; by sampling the distribution Nk ^; (D0 + ni=1 x0i xi )?1 where ^ = P P (D0 + ni=1 x0i xi )?1 (D00 + ni=1 xi zi ). 4. Go to step 1 and repeat steps 1-3 using the most recent values of all conditioning variables.

4.2 Weibull and log-logistic models When one is applying the sequential model to failure time data it may be useful to compare the t of the model to familiar models for continuous survival data, such as the Weibull and log-logistic. Let the event (Yi = j; di) be denoted by (ti = j; di) where di is the noncensoring indicator and let i = exp(?x0i  ). Then, under the Weibull model, the hazard function and survival functions are given by

h(ti) = i (iti ) ?1 S (ti) = exp(?(i t) ) respectively, where and  are unknown parameters. Given a sample of data (ti = j; di), (i  n), the likelihood function of the model is

L( ; ) =

Y

i:i=1

h(ti)S (ti )

Y

i:i=0

S (ti )

and the posterior density by  ( ;  jt1; :::; tn) /  ( ;  )f (t1; :::; tnj ;  ), where  ( ;  ) is the prior density. In the case of the log-logistic model the corresponding hazard and survivor functions are i (iti ) ?1 h(ti) =  [1 + ( t ) ]

and

ii

S (ti) = 1 + (1 t ) ii

15

respectively. The likelihood function and posterior density take the same form as in the Weibull case. The posterior density in each model can be simulated using a MetropolisHastings algorithm (Chib and Greenberg (1995)).

5 Comparison of models by Bayes factors 5.1 Introduction

As discussed in Section 2.2, the use of the classical deviance statistic is limited in that it does not allow for the comparison of non-nested models such as the basic cumulative ordinal and sequential models described above. An alternative approach for comparing models which can be nested or non-nested is based on the notion of a Bayes factor. If y denotes the observation and  denotes the parameter, a Bayesian model M consists of a sampling density f (y jM; ) and a prior density  (jM). For any two models, Mr = ff (y jMr; r ), (r jMr)g and Ms = ff (yjMs; s), (sjMs)g, the Bayes factor of r vs s is de ned as R f (yjM ;  )( jM )d m ( y jM ) r Brs = m(yjM ) = R f (yjMr ; r )(rjMr)dr s s s s s s where m(y jMk ) is the marginal likelihood of Mk [Kass and Raftery (1995)]. The Bayes factor can be understood much like a likelihood ratio. For example, if Brs = 10, then model Mr is supported 10 times more by the data then model Ms. It is convenient, as in Good (1967), to express the Bayes factor on a log base 10 scale. If log Brs = 0 , the models receive equal support from the data, and if log Brs = ?2, model Ms has 100 times more support from the data. To compute the Bayes factor, the marginal likelihood m(y jMr) needs to be evaluated for each model and this quantity is not directly obtainable from the output of the MCMC simulation which estimates the posterior distribution. In this paper we adopt the procedure of Chib (1995) to estimate the marginal likelihood from the MCMC output. The Chib method is based on the identity   m(y) = f (yj()jy() ) where for convenience we have suppressed the model indicator. In this identity, which follows from Bayes theorem,  is any point in the parameter space (usually taken to be a high 16

density point, such as the posterior mean or the posterior mode). The likelihood ordinate f (yj ) is available for each model and the remaining issue is about the speci cation of the prior and the computation of the posterior ordinate.

5.2 Priors The formulation of proper prior distributions is an important element when the goal of the analysis is to compare (non-nested) models using Bayes factors. The parameters of the cumulative ordinal model are the regression vector and the vector of cut-points . It is convenient to assume that the parameters and are a priori independent and to assign a multivariate normal distribution with mean vector 0 and covariance matrix B0?1 . A normal prior distribution is not appropriate for due to the order restriction in the components. As discussed above, can be reexpressed to a real-valued vector = ( 1 ; :::; J ?1) by means of the transformation

1 = log 1 ; j = log( j ? j?1 ) ; 2  j  J ? 1:

(11)

One can then assign a multivariate prior distribution with mean 0 and covariance matrix A?0 1 . In the sequential ordinal case, the unknown parameter vector is 0 = ( 1; :::; J ?1;  ). In this case, the cutpoints 1; :::; J ?1 are real-valued. So it is convenient to assign a N( 0; B0) prior to . In the following, we focus on a choice of prior in the sequential case | similar ideas will apply to the cumulative ordinal case. The assignment of the hyperparameters values f 0; B0) is generally dicult because and are nonlinearly related to the ordinal probabilities and little information typically exists about the location of the cutpoints. One method for assigning a proper di use prior relies on the notion of a training sample (Lempers, 1971, chapter 5). Suppose that a considerable amount of sample data is available. Then, one can use a small part of the data to assess the values of the multivariate normal hyperparameters and the remainder of the data in model comparison. Subdivide the data f(xi ; Yi)g at random into two parts f(x(i t); Yi(t))g, and f(x(i e); Yi(e) )g. First t the ordinal model to the training sample f(x(i t); Yi(t) )g using a at noninformative prior on . From this t, posterior means and variances of the parameters 17

are computed; these values are used as estimates of the multivariate normal hyperparameters of the prior. Then the model is t a second time to the remainder of the dataset f(x(ie); Yi(e))g using this informative prior. The use of training sample will be illustrated in Section 6 in comparing a variety of discrete time and continuous time survival models.

5.3 Marginal likelihood of sequential models In the basic sequential model the parameter vector is  = (; ) = and the posterior ordinate  ( jy ) can be obtained from the representation

Z

( jy) = ( jy; z) (zjy) dz ; where  ( jy; z ) is the multivariate normal density that appears in Algorithm 1. Let fz (g)g P P be draws from the MCMC run and ^(g) = (B0 + ni=1 Xi0Xi )?1(B0 0 + ni=1 Xi0zi(g) ). Then a consistent estimate of the posterior ordinate is given by

^( jy) = G?1

G X g=1

k ( j ^(g); B ?1 )

where k () denotes the k-variate multivariate normal density function. The marginal likelihood of the sequential ordinal model on the log scale is now immediately available as

1 0 G X k ( j ^ g ; B ? )A ; ln m(y ) = ln f (y j  ) + ln k ( j ; B ? ) ? ln @G? 0

0

1

( )

1

1

g=1

where f (y j  ) is the density of the data in (4) evaluated at  . The numerical standard error of this estimate can be found by the procedure discussed in Chib (1995).

5.4 Marginal likelihood of the cumulative model A similar calculation is available for the cumulative ordinal model. In this model  = ( ; ) and from a marginal/conditional decomposition one can write

( ;  jy) = ( jy)( jy;  ) ;

R

where the marginal ordinate  ( jy ) =  ( jy; z )  (z jy ) dz is estimated as G X  ? 1 k (  j^(g); D?1) ^( jy) = G g=1

18

P

P

P

where ^(g) = (D0 + ni=1 xi xi )?1 (D00 + ni=1 xi zi(g) ), D = D0 + ni=1 xi xi , and fz (g)g are draws from the MCMC run in Algorithm 3. The conditional ordinate  (  jy;  ) is estimated by using Step 1 of Algorithm 3 to draw a sample of from the distribution jy;  , after xing  at  . Given this sample, the conditional ordinate is estimated by the kernel method. Under a Gaussian kernel the estimate is given by

^ (  jy;  ) = G?1

G JY ? X

1

g=1 j =1

( j j (jg); 2j )

where j is the bandwidth [Scott (1992)]. Putting all the terms together, the marginal likelihood of the cumulative ordinal model on the log scale is given by J X ? X  ln ( j ? x0i  ) ? ( j? ? x0i  ) + ln  ( ) + ln  (  ) 1

j =0 i:yi =j

G JY G ?1 X X  (g ) ? 1 ? 1 ? 1 ^ k ( j ; D ) ? ln G ? ln G ( j j (jg); 2j ) g=1 j =1 g=1 where the cut-points j are expressed in terms of the 's.

6 Example

6.1 Introduction and preliminary analysis To illustrate the use of sequential modeling for discrete-time survival data, we consider data on the length of hospital stay for 1000 patients undergoing heart surgery. The ordinal response variable Yi is the number of days of hospital stay. A relatively small number of patients stay longer than 12 days, and so for convenience we de ne Yi = 12 for patients who stay 12 days or longer. Associated with the response variable are eight covariates de ned in Table 1 which might in uence the length of stay. In this particular example, censoring occurs if the patient dies in the hospital | one does not know the length of the patient's stay if he or she had lived. As a preliminary look at this data, Figure 1 plots values of the observed hazard function

Pr(Yi = j jYi  j ): 19

Covariate AGE GENDER RACE PRIVINS INDIG COMORB CAMG PTCA

Description in years coded 1 if male, 0 if female coded 1 if white and 0 otherwise coded 1 if primary payer is private insurance coded 1 if this is a charity care by the hospital number of patient comorbidities coded 1 if the patient had a coronary artery bypass graft coded 1 if the patient had an angioplasty

Table 1: Description of eight covariates for the hospital stay example. against the day number j (1  j  12) for subgroups of the sample formed by di erent values of each of the eight covariates. For the binary covariates, values of the observed hazard for the subgroup coded 1 are plotted using the symbol 'o' and the hazard for the subgroup coded 0 are displayed using the symbol 'x'. For this graph, the AGE variable was categorized into two groups (1 = young, 0 = old) and the 'o' ('x') symbol corresponds to the young (old) patients. Likewise, the COMORB variable was divided into two levels (0-2 = low, 3 or more = high) and the 'o' in the plot corresponds to the high value of COMORB. Some general features of the data are evident from Figure 1. Here the hazard can be interpreted as the probability of going home on a particular day, given that the patient is still in the hospital. The hazard generally increases from a value of .1 to a value of .4 about day 7 and then slowly decreases. The two covariates COMORB and CAMG appear to in uence the hazard { for each variable, the hazard function is signi cantly higher for patients where the variable is absent (covariate value equal to 0) relative to patients where the variable is present (value of 1). This means that patients with these characteristics will have longer hospital stays. The patterns for the remaining covariates are much less clear and it is hard to tell from Figure 1 which variables may be statistically signi cant.

6.2 Model tting and comparison This preliminary analysis motivates the consideration of a series of ordinal response regression models. All of the models t will assume the probit link where F = . For each 20

WHITE

0.2 0

0.4

5 INDIG

0

0.4

5 CAMG

0.2 0

0

5 DAY NUMBER

5

10

5

10

5

10

PRIVI

0.2 0

0.4

COMOR

0.2

0

0.4

PTCA

0.2 0

10

0

0.4

0

10

MALE

0.2

0

10

0.2 0

0.4

0

10

HAZARD

5

HAZARD

HAZARD

0

0.4

0

HAZARD

HAZARD

0.2 0

HAZARD

AGE

HAZARD

HAZARD

0.4

0

5 DAY NUMBER

10

Figure 1: Empirical hazard function Pr(Yi = j jYi  j ) plotted for subgroups of the sample formed by di erent values of the eight covariates. In each graph, the symbol 'x' corresponds to the hazard for the covariate value of 0 and a symbol corresponds to the hazard for a positive value of the covariate. ordinal model that is t, we use an independent training sample of size 200 to estimate the hyperparameters of a multivariate normal prior on the unknown regression parameters and cutpoints. Since a proper prior is available in this example, we can compute the marginal likelihood for each model. We use these marginal likelihoods as a device for searching for parsimonious models that provide good ts to the data. We initially consider the following four models: 1. M1 : The basic cumulative ordinal response model (7). 2. M2 : The basic sequential model (2). As there are J = 12 possible values of the ordinal response, this model includes J ? 1 = 11 distinct parameters f j g to model the baseline hazard function, and a vector  of eight parameters to model the e ects of the covariates which will in uence the baseline hazard in a global fashion. 21

Covariates Complete Reduced M1 -2172.8 -2169.9 M2 -2127.2 -2125.4 M3 -2145.0 -2138.1 M4 -2121.2 -2117.4 M5 -2230.4 -2213.3 M6 -2255.6 -2244.0 Table 2: Log marginal likelihoods of alternative models: M1 is the basic ordinal model; M2 is the basic sequential model; M3 is the sequential model with COMORB as category interaction variable; M4 is the sequential model with j restricted to a polynomial; M5 is the Weibull model; and M6 is the log logistic model. 3. M3 : The sequential model (10) using COMORB as a category-speci c covariate. This model states that the e ect of the variable COMORB di ers across time. In the basic sequential model only one parameter is used to model the e ect of COMORB | here J ? 1 = 11 parameters are used to model the e ect of COMORB for each day. 4. M4 : The sequential model where the baseline hazard function, as modeled by the cutpoints f j g, are restricted to fall on a quadratic polynomial with unknown coecients. Table 2 (the \complete" column) gives values of the logarithm of the marginal likelihood, ln m(y jM), for each of the four models M1 ? M4. These marginal likelihoods can be used to compute Bayes factors to compare models. For example, the Bayes factor in support of the basic sequential model M2 compared to the basic ordinal model M1 is given by (y jM2) = expf?2127:2 + 2172:8g = 6:4  1019: BF21 = m m(yjM ) 1

Thus there is strong support for the sequential model for this dataset. Comparing all four models, the largest value of the log marginal likelihood ( ?2121:2) is attained at model M4, which implies that there is support for restricting the baseline hazard function to a quadratic form. There is little evidence for the inclusion of COMORB as a category interaction variable | the value of the log marginal likelihood for model M3 is signi cantly higher than the value for the basic sequential model M2. 22

Covariates Complete Reduced Covariate Mean (Std. Dev.) Mean (Std. Dev.) AGE .003 (.003) * GENDER -.011 (.050) * RACE -.088 (.049) -.091 (.049) PRIVINS -.119 (.047) -.133 (.046) INDIG .092 (.101) * COMORB .121 (.009) .121 (.009) CABG .772 (.074) .768 (.076) PTCA .110 (.049) .110 (.050) Table 3: Posterior means and standard deviations of covariates of model M4 (complete and reduced forms) for the hospital stay example. The posterior means and standard deviations for all eight covariates (the \complete" model) for model M4 are displayed in Table 3. By looking at the sizes of the posterior means relative to the sizes of the standard deviations, it appears that variables AGE, GENDER, and INDIG are insigni cant and can be removed from the model. All of the models were re t using the reduced set of covariates; the log marginal likelihoods for these models are listed under the \reduced" column of Table 2. Note that, for models M1 ?M4, the reduced form with the inclusion of ve covariates has a larger marginal likelihood than the complete form which includes all eight covariates. Based on this analysis, the best model appears to be M4 with the ve covariates RACE, PRIVINS, COMORB, CABG, and PTCA. An alternative, perhaps simpler approach to modeling this data is to assume that the response Yi is observed in real-time, and then apply a continuous time survival model. This approach was used using Weibull and log logistic error distributions. As in the previous ordinal analyses, a prior was estimated using the training sample of 200 observations, and the model was t using the new sample of 1000. Table 2 gives values of the log marginal likelihood for these two error distributions using the complete and reduced set of covariates. Note that the values of the marginal likelihoods are substantially smaller than that for M5 indicating that the sequential model provides a better t to this data. To interpret the model M1 , Table 3 gives the posterior means and standard deviations of the ve covariates. The coecients for RACE and PRIVINS are negative, which indicate 23

that the whites and patients with private insurance had shorter hospital stays. In contrast, the COMBORB, CABG and PTCA variables have positive coecients, indicating that these conditions cause extended hospital stays. The tted hazard functions are displayed in Figure 2. Consider a patient who is nonwhite (WHITE = 0), does not hold private insurance (PRIVINS = 0), had zero comorbidities (COMORB = 0), did not have a coronary artery bypass graft (CABG = 0), and did not have an angioplasty (PTCA = 0). The posterior mean of the hazard function E(F ( k ? x0  )) is graphed against the day number k and displayed as a solid line in each of the graphs in Figure 2. The dotted line in each graph represents the posterior mean of the hazard function when the corresponding covariate value for this patient is changed. So, for example, the dotted line in the upper left gure corresponds to a patient where PRIVINS = COMORB = CAMG = PTCA = 0 and WHITE = 1. These graphs clearly show how the baseline hazard changes in a global fashion as a function of the ve signi cant covariates.

6.3 Hierarchical modeling In the previous section, a training sample was used to construct proper priors for the regression and cuto parameters, and models were compared by means of Bayes factors. An alternative approach to model checking (that does not require any training sample) is to construct a hierarchical prior distribution that includes several of the models, say MA and MB , as special cases. The hyperparameters at the last stage of the hierarchical model are informative about the relative suitability of the two models. The posterior distribution of these hyperparameters is informative about the goodness of t. To illustrate the application of hierarchical modeling in this example, suppose that we wish to contrast the basic sequential model M2 with the sequential model M4 where the baseline hazard is restricted to lie on a quadratic curve. Consider the hierarchical form of the sequential model, where at the rst stage, the regression vector  is assigned a at prior, and the cutpoints f j g are independent normal with means f0 + 1 j + 2 j 2g and common variance  2 . At the last stage of the prior, the vector (0; 1; 2) is assigned a uniform prior, and  2 is assigned the inverse gamma (c; d) prior proportional to ( 2 )?a?1 expfb= 2g. In 24

1 WHITE

HAZARD

HAZARD

1

0.5

0

0

5

0

5

10

1 2 COM

HAZARD

HAZARD

0.5

0

10

1

0.5

0

PRIVI

0

5

CAMG 0.5

0

10

0

5 DAY NUMBER

10

HAZARD

1 PTCA 0.5

0

0

5 DAY NUMBER

10

Figure 2: Posterior means of the hazard function for di erent covariate values. The solid line in each graph represents the baseline hazard for an individual where all covariates are equal to zero and the dotted line represents the hazard for an individual where the particular covariate has been changed to 1 (2 for the COMORB covariate). this example, the inverse gamma hyperparameters are chosen to be the values c = d = 1, re ecting little knowledge about the value of this prior variance. Note that, as  2 approaches zero, this hierarchical sequential model converges to the model M4 which places the restriction on the cutpoints, and if  2 is large, the prior has minimal impact and the model reduces to the basic sequential model M2. So this hierarchical model can be viewed as a generalization of models M2 and M4, where  2 re ects the degree of belief of the two models. This model was t using the Gibbs sampling/data augmentation algorithm outlined in Section 4. Table 3 displays posterior means of the cutpoint parameters f j g under the basic sequential model M2 , the sequential model M4 , which places the restriction on the cutpoints, and the hierarchical model. From inspecting the sets of posterior means, one observes that the hierarchical model posterior means generally fall between the posterior 25

means for models M2 and M4 . The hierarchical posterior means represent a smoothed version of the cutpoints under the basic sequential model. Cutpoint M2

1 -.765

2 -.574

3 -.356

4 -.074

5 .074

6 .236

7 .442

8 .380

9 .269

10 .225

11 .451

M

4

-.815 -.538 -.297 -.092 .077 .210 .307 .368 .393 .382 .335

Hierarchical -.804 -.580 -.355 -.123 .034 .206 .400 .317 .288 .225 .441

Table 4: Posterior means of the cutpoint parameters f k g under the basic sequential model M , the restricted model model M , and the hiearchical model for the hospital stay 2

example.

4

7 Remarks and conclusions In this article, the sequential model has been analyzed in relation to other models for ordinal data. The sequential model is also applicable in the analysis of discrete time (grouped) survival data. For such problems, the model was extended to include covariates which a ect the hazard rate in a general or category-speci c manner and hierarchical versions of the model were speci ed and t. There are several attractive aspects to the Bayesian approach that is discussed in this paper. First, the latent data representation of the model leads to simple tting procedures based on Markov chain Monte Carlo methods. These methods are straightforward to code and, although this was not discussed in the paper, produced MCMC output that mixed remarkably well. Second, the latent data structure can be generalized in a straightforward manner by the use of mixtures | this approach was illustrated in the case of hierarchical modeling in Section 3.2. Third, the Bayes marginal likelihood approach provides an e ective mechanism for evaluating the signi cance of covariates, and comparing non-nested models, such as the cumulative ordinal, the sequential ordinal, and continuous survival models with Weibull and log logistic error structures. 26

For future research, there are two issues that can be addressed to facilitate the use of sequential models in practice. In the event that training samples are unavailable, a suitably vague proper prior needs to be constructed. It is dicult in practice to assign multivariate normal priors to the vectors of cutpoints and regression parameters. An attractive alternative method of specifying priors places distributions directly on the multinomial probabilities of interest, such as the conditional means priors discussed by Bedrick et al (1996) in the context of certain generalized linear models. Also, there needs to be further work on the development of appropriate diagnostics, Bayesian or classical, for these classes of models. From a frequentist perspective, it can be dicult to interpret the size of a residual, due to the discreteness of the response variable. As demonstrated by Albert and Chib (1995) in the binary regression case, the latent data can be used to de ne a continuous-valued Bayesian residual. This residual has a known prior distribution, and the inspection of the set of posterior distributions is useful in outlier detection. The use of latent data may also be useful for residual analysis in the sequential model.

References Albert, J. and Chib, S. (1993), \Bayesian analysis of binary and polychotomous response

data," Journal of the American Statistical Association, 88, 669{679. Albert, J. and Chib, S. (1995), \Bayesian Residual Analysis for Binary Response Regression Models," Biometrika, 82, 747-759. Bedrick, E. J., Christensen, R., and Johnson, W. (1996), \A New Perspective on Pri-

ors for Generalized Linear Models," Journal of the American Statistical Association, 91, 1450-1460. Chib, S. (1995), \Marginal Likelihood From the Gibbs Output," Journal of the American Statistical Association, 90, 1313-1321. Chib, S. and Greenberg, E (1995), \Understanding the Metropolis-Hastings algorithm," American Statistician, 49, 327{335. Cowles, M. K. (1996), \Accelerating Monte Carlo Markov Chain Convergence for CumulativeLink Generalized Linear Models," Statistics and Computing, 6, 101-111. Efron, B (1988), \Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve," Journal of the American Statistical Association, 83, 414-425. 27

Farhmeir, L. and Tutz, G. (1994), Multivariate Statistical Modelling Based on General-

ized Linear Models, Springer Verlag: New York.

Gelfand, A. E. and Smith, A.F.M. (1990), \Sampling-Based Approaches to Calculating

Marginal Densities," Journal of the American Statistical Association, 85, 398-409. Good, I, J. (1967), \A Bayesian Signi cance Test for Multinomial Distributions," Journal of the Royal Statistical Society, B, 29, 399-431. Kass, R. E. and Raftery, A. E. (1995), \Bayes Factors," Journal of the American Statistical Association, 90, 773-795. Lempers, F. B. (1971). Posterior Probabilities of Alternative Linear Models. Rotterdam University Press. Mantel, N and Hankey, B.F. (1978), \A Logistic Regression Analysis of Response Time Data where the Hazard Function is Time Dependent," Communications in Statistics, A, 7, 333-347. McCullagh, P. (1980), \Regression Models for Ordinal Data," Journal of the Royal Statistical Society, B 42, 109-127. McCullagh, P. and Nelder, J.A. (1989), Generalized Linear Models, Chapman and Hall, New York. Scott, D. W. (1986), Multivariate Density Estimation, New York: John Wiley and Sons. Tutz, G. (1990), \Sequential Item Response Models with an Ordered Response," British Journal of Mathematical and Statistical Psychology, 43, 39-55. Tutz, G. (1991), \Sequential Models in Categorical Regression," Computational Statistics and Data Analysis, 11, 275-295.

28

Suggest Documents