Covariate Functional Form in Cox Models

0 downloads 0 Views 554KB Size Report
Oct 19, 2005 - Covariate Functional Form in Cox Models. Luke Keele. 2140 Derby Hall. 150 North Oval Mall. Ohio State University. Columbus, OH 43210.
Covariate Functional Form in Cox Models Luke Keele 2140 Derby Hall 150 North Oval Mall Ohio State University Columbus, OH 43210 October 19, 2005 Abstract In most event history models, the effect of a covariate on the hazard is assumed to have a log-linear functional form. For continuous covariates, this assumption is often violated as the effect is highly nonlinear. Assuming a log-linear functional form when the nonlinear form applies causes specification errors leading to erroneous statistical conclusions. Scholars can, instead of ignoring the presence of nonlinear effects, test for such nonlinearity and incorporate it into the model. I review methods to test for and model nonlinear functional forms for covariates in the Cox model. Testing for such nonlinear effects is important since such nonlinearity can appear as nonproportional hazards, but time varying terms will not correct the misspecification. I investigate the consequences of nonlinear function forms using data on international conflicts from 1950-1985. I demonstrate that the conclusions drawn from this data depend on fitting the correct functional form for the covariates.

1

Introduction

Political scientists that are interested in estimating event history or time to event models have a variety of options available to them. One popular option is to use a parametric functional form such as Weibull distribution to model the hazard or the rate of transition from one state to another. Or instead analysts can fit a logit model with a term for time dependence in the data. While these are reasonable options, all of them require the analyst to make arbitrary decisions about the form hazard or time dependence. Each one imposes restrictions on the probability of time to event. Of course, there is an alternative to these more restrictive models: the semi-parametric Cox model. The Cox model does not assume a functional form for the hazard but, instead, allows it be estimated from the data. The flexibility of the Cox model has led to its complete

1

dominance of the biostatistics literature and increasing popularity in political science. In fact, the authoritative text on event history in political science recommends that analysts generally fit Cox models (Box-Steffensmeier and Jones 2004). The widespread usage of Cox models has also spawned numerous diagnostics. Tests for nonproportional hazards, competing risks and misspecification are all commonly used diagnostics for Cox models. Despite the large amount of information available to analysts on the Cox model, political scientists are neglecting one important diagnostic for the Cox model. Cox models with continuous variables are prone to misspecification due to not fitting the correct functional form for continuous covariates. Normally, it is assumed that covariates enter into the Cox partial log-likelihood in a loglinear fashion, but continuous covariates can affect the hazard through more complicated nonlinear functional forms. Failure to address such nonlinearity can radically alter the conclusions that analysts draw from model estimates. The plan, here, is to give political scientists the tools to address misspecification of the Cox model due to the incorrect functional forms for covariates. First, I discuss how incorrect functional forms are a form of misspecification that biases estimates in the Cox model as well as reduces the power of statistical tests. I, next, outline strategies for diagnosing incorrect functional forms as well as a means for introducing complex functional forms into the estimation of the Cox model itself. One point of emphasis will be on finding a solution that imposes the least amount of assumptions on the covariate functional form. Finally, I use data on international disputes to demonstrate how the wrong covariate functional form can affect the conclusions we draw about empirical processes.

2

Covariate Functional Form

Nonlinear covariate functional forms are not limited to event history models as most any parametric model can have such nonlinearities. For example, a linear statistical model for a continuous outcome variable, yi , generally takes the following familiar form:

yi = β0 + β1 xi + β2 zi + εi

(1)

The model in Equation 1 is linear in both its overall functional form as well as in the pa-

2

rameters. We can easily accommodate nonlinearity in the parameters by including quadratic or cubic terms on the right hand side of the model. Of course, quadratic and cubic terms are fairly restrictive functional forms for the nonlinearity in the parameters. Instead, we can write the model in the following way:

yi = β0 + β1 xi + β2 f (zi ) + εi

(2)

Here, f (zi ) is some function f that we only assume is smooth in zi . A statistical model as in Equation 2 can be estimated with semiparametric techniques (Hastie and Tibshirani 1990; Beck and Jackman 1998). The point, here, is that the functional form for one covariate in a linear model need not be linear. This nonlinearity in the covariate functional form extends to event history models, and I will argue is much more common. In the event history model framework, the outcome variable yi is the time of duration until the occurrence of some event or “failure.” The conditional probability of time to event or failure at time t (or hazard rate) is written as: P r(t ≤ T < t + ∆t | T ≤ t) ∆t→0 ∆t

h(t) = lim

(3)

Applied analysts are interested in how a matrix of k covariates, Xi , shift the hazard up or down. We estimate a vector of coefficients β, which can be transformed to represent the marginal effect of xi on the hazard rate. There are a variety of ways that we can parameterize the hazard rate to estimate β. The Cox (1972) proportional hazards model is one widely used statistical model for assessing how a set of covariates affects the hazard rate. Much of the popularity of the Cox model stems from the fact that we need not assume a specific probability distribution for the hazard rate. Box-Steffensmeier and Jones (2004) recommend that political scientists most often use the Cox model since political theories are rarely specific enough to allow an a priori choice of the probability distribution for the process in question. The Cox proportional hazards model parameterizes the hazard rate, h(t), in the following way:

h(t | Xi ) = h0 (t)eβXi

3

(4)

In the Cox model, h0 (t) is an unspecified nonnegative function of time called the baseline hazard. Xi denotes a covariate matrix for subject i, and one or more of the covariates may vary over time. In the Cox framework, we assume a loglinear functional form for the covariates.1 However, covariates in the Cox model can have nonlinear functional forms just as in the linear model from Equation 2. Adding a nonlinear functional form for a covariate in the Cox framework requires rewriting the model in the following way:

h(t | Xi ) = h0 (t)eβ1 xi +β2 f (zi )

(5)

Now the effect of zi no longer follows the standard loglinear form but instead affects the baseline hazard through some smooth function f . Applied analysts rarely pay any attention to the functional form for covariates in the linear model, so why should this be of any more concern with event history models? The answer is that the nature of event history data makes nonlinear functional forms far more likely than in standard linear models and the consequences are more serious. Recall that the outcome variable in an event history model is time until an event. Often, for a continuous covariate, the probability of the event occurring is far more likely for a small range of that covariate that for some other much larger range of that same variable. A common example of such variation is a threshold effect. Here, the effect of a covariate on the baseline hazard is minimal for a large range and then the effect of that covariate on the risk rises dramatically once a threshold is met. Such threshold effects are common in biological data where a drug therapy is ineffective until the dosage reaches the right level (Therneau and Grambsch 2000). We can expect to find such nonlinear effects anytime that we suspect that the effect of a covariate differs across the values of that covariate itself. There are a variety of political contexts where we might expect to find such nonlinear effects. One such example might occur in the study of the duration of conflicts as a function of casualties. The risk of conflict termination might start out quite high for low numbers of casualties, but once a threshold hold is met, the risk of conflict termination may drop or instead the risk might fall and then rise again. We might also expect nonlinear functional 1

This loglinear functional form for the covariates is not unique to the Cox model as it applies to models that have a parameteric structure for the baseline hazard such as the Weibull model.

4

forms in models of Congressional position taking. Box-Steffensmeier, Arnold and Zorn (1997) find that campaign contributions matter to the amount of time before a Member of Congress takes a position on the North American Free Trade Agreement. The effect of campaign contributions is probably minimal for small amounts of money but then increases before reaching a threshold where additional contributions have little effect. If this is true, then the loglinear functional form for this covariate is incorrect. In many political science contexts, the a priori assumption of loglinear effects for continuous covariates is not plausible.2 Moreover, ignoring the functional form of covariates in the context of event history models has serious consequences. First, if we can estimate the true functional form for a covariate, it can change the interpretation of that effect. Assuming that campaign contributions have a linear effect on position taking is quite different from understanding that the risk of position taking is negligible until a threshold is crossed. But there are statistical concerns beyond those of substance. Fitting the incorrect functional form to a covariate is a form of misspecification and leads to the same statistical errors: bias and decreased power of tests for statistical significance (Lagakos and Schoenfeld 1984; Struthers and Kalbfleisch 1986; Therneau, Grambsch and Fleming 1990; Anderson and Fleming 1995). LeBlanc and Crowley (1999) demonstrates that the average model error for a Cox model with the incorrect covariate functional form is three times higher than a model that accounts for the nonlinearity. Also as with any GLM, we can expect the misspecification to affect the estimates for other variables, even if those variable are not correlated with the misspecified covariate. In general, the effects of incorrect functional form are similar to that of nonproportional hazards, which is itself a form of misspecification. In fact, it is critical to understand the relationship between nonproportional hazards and incorrect function form before proceeding to diagnostics and solutions.

2.1

Nonproportional Hazards and Covariate Functional Form

A key assumption for many types of survival models, including the Cox model, is that the hazard ratios are proportional to one another and that proportionality is maintained over 2

Therneau and Grambsch (2000) note that nonlinear functional forms are more likely to be found in data with time-varying covariates.

5

time. Box-Steffensmeier and Zorn (2001) review how the assumption of proportional hazards may be violated in political science contexts, and they outline strategies for the diagnosis of and corrections for nonproportional hazards. As they argue, correcting for nonproportional hazards is critical since it can lead to biased and the power of statistical significance tests will decline, much like the effect of incorrect functional form. It is critical to note, however, that signs of nonproportional hazards can be caused by model failures other than a time-varying effect. Nonproportional hazards may be caused by several different model failures. Incorrect functional form for a covariate is one model failure that can lead to a diagnosis of nonproportional hazards (Therneau and Grambsch 2000). Diagnostics for nonproportional hazards, such as Schoenfeld residual plots, will suggest nonproportionality when in fact the problem is the incorrect functional form for a covariates (Therneau and Grambsch 2000). But correcting for nonproportional hazards with log-time interactions when the model failure is due to incorrect functional form will not rectify the misspecification. The relationship between nonproportional hazards and the correct functional form for covariates suggests a sequence to how analysts ought to perform diagnostics for the Cox model. That is, correct functional forms for covariates needs to be found and fitted before testing for nonproportional hazards. Proceeding in this order will ensure that signs of nonproportional hazards are not due to nonlinear covariate function forms. I, next, turn to techniques for diagnosing and estimating flexible functional forms for continuous covariates in Cox models.

3

Diagnostics and Estimation

For a continuous covariate in a Cox model, we need to first diagnose what the functional form is. Then should we identify the functional form as something other than loglinear, we need a method for incorporating this nonlinearity into the Cox model itself. Moreover, we want to ensure that any nonlinear estimates also have estimates of statistical precision. I start with techniques for basic diagnosis.

6

3.1

Visual Methods of Diagnosis

The basic diagnostics for uncovering the functional form of a covariate in a Cox model are similar to those for diagnosing nonproportional hazards. The first step is to examine residual plots for the model in question to try and discern a nonlinear patterns to the effect for each continuous covariate. But defining residuals in the context of event history models is less straightforward than in the context of linear models. Residuals for the Cox model rely on counting process theory and martingales. Box-Steffensmeier and Jones (2004) have a basic introduction to the concepts behind martingale residuals for Cox models, and Therneau and Grambsch (2000) provide a more advanced coverage of martingale residuals. Software packages with Cox model routines typically estimate martingale residuals. Therneau, Grambsch and Fleming (1990) suggest plotting the martingale residuals from a null Cox model, one without any predictors, against any continuous covariates suspected of nonlinearity.3 To discern the functional form of the covariate’s effect, scatterplot smoothers can be applied to the plot. Here, the analyst can use all the standard techniques for scatterplot smoothing. Under certain assumptions, the scatterplot will display the smooth form f , or the functional form of the covariate’s effect on the hazard rate. These plots are similar to variable plots used with more familiar linear models. These plots have one important and significant fault. If the variables included in the model are correlated, these plots can be misleading. Therneau and Grambsch (2000) demonstrate that when correlations are present in the data, the plot may show a linear relationship when the true relationship is nonlinear. Moreover, nonlinear relationships will often appear linear. Given that we expect that the variables in most political science specifications to be correlated this is a serious flaw. A better strategy is to test for nonlinearity within the estimation of the Cox model itself.

3.2

Spline Solutions

Instead of looking for patterns in the residuals, we can model the covariate functional form directly and use a Wald test to decide whether the nonlinear effect should remain in the specification. The simplest smooth functional forms to fit are polynomials. Here, we would simply place xi , xi 2 , xi 3 , and so on on the right hand side of the model. But polynomials 3

For time dependent covariates, one often needs to collapse the martingale residuals by subject.

7

are a crude way to approximate what might be a more complex functional form. Moreover, the fit to the data with polynomials is global and not local which can obscure real patterns in the data (Hastie and Tibshirani 1990). Spline fits to the data are a better alternative. Splines are piecewise polynomial functions that are constrained to join at control points in the data. Typically, the piecewise polynomial functions are cubic and forced to be smooth at the control points, or knots, by forcing the first and second derivatives of the functions to agree at the knots. Spline curves fit with computer software have an easy intuitive analogy. Splines are similar to interpolating a smooth line with a string between a set of nails on a board. Splines have several desirable properties. First, unlike with polynomials, the fit is local. Second, the fit beyond the last control point can be constrained to be linear. One disadvantage with standard splines is the analyst is left to choose the number of knots between which the smooth line is drawn. Choosing the placement and number of the knots can be arbitrary and the choice of knot location can mask or reveal features in the data. A better choice is to use smoothing splines, where knot selection is automatic based on a mean squared criterion. Hastie and Tibshirani (1990) show that the fit of smoothing splines is preferable to that of splines where the choice of knots is left to the analyst. With smoothing splines, the user only need select the level of smoothness, which is done by selecting the degrees of freedom for each spline fit.4 Typically four degrees of freedom is sufficient with further selection done via the Akaikie’s Information Criterion (AIC). For more on splines see Beck and Jackman (1998). The use of smoothing splines has another advantage. Therneau and Grambsch (2000) have rewritten the Cox partial log-likelihood with a penalty for smoothing splines, which allows an analyst to perform a Wald test on the smoothing spline fit between the covariate and the hazard rate. The Wald test gives the analyst a standard p-value for the null of linearity. This Wald test is the most reliable means of assessing whether the functional form for a covariate is nonlinear or can remain in its simpler loglinear form. In summary, testing the functional form for a covariate is an important diagnostic for the Cox model. Ignoring nonlinear effects in the Cox model will bias the coefficients of included 4

Please see the appendix for more details.

8

covariates and affects the power of tests for statistical significance. Moreover, failure to diagnose the correct functional form for a covariate can lead to signs of nonproportional hazards and using a solution for nonproportional hazards will not cure the associated bias or inefficiency. While visual tests of the martingale residuals can be useful for revealing the correct functional form, smoothing splines are the best option for diagnosing and fitting nonlinear functional forms to covariates. I, next, use an empirical example to illustrate the importance of fitting the correct functional form for covariates.

4

Covariate Functional Form and International

Disputes To demonstrate the importance of testing for nonlinear covariate functional forms, I use data on international disputes. The data set is composed of 827 “politically relevant” dyads for the period from 1950 to 1985. Each observation is composed of a dyad year, for a total of 20,900 observations with an average of 25.4 years per dyad. The outcome variable is the time until the onset of a militarized interstate dispute between the two nations that make up the dyad (Oneal and Russett 1997; Beck, Katz and Tucker 1998; Reed 2000; Box-Steffensmeier and Zorn 2001). In past work, seven different factors have been identified as important to the risk of a dispute. The six factors are: (1) the level of democracy in the dyad, (2) economic growth, (3) the presence of an alliance between the two nations in the dyad, (4) geographical contiguity in the dyad, (5) the ratio of military capability between the two nations, (6) the level of intradyadic trade measured as a proportion of GDP, and (7) a counter for the number of previous disputes for 1950-1985. All the variables are operationalized as in Beck, Katz and Tucker (1998) and Box-Steffensmeier and Zorn (2001).5 Contiguity and previous disputes should increase the risk of a conflict, while the rest of the measures should lower the hazard. As Box-Steffensmeier and Zorn (2001) demonstrate several of the covariates have time varying effects that cause nonproportional hazards. I use this property of the data to show how testing for nonlinear functional forms and nonproportional hazards is an interactive 5

Just as in Box-Steffensmeier and Zorn (2001) I omit dyad-years of continuing conflicts as a means of accounting for repeated events.

9

process. Of the variables in the analysis, we might suspect three to have nonlinear effects on the hazard. Three of the covariates, growth, capability, and trade are continuous and thus should be tested for nonlinear functional forms. Since nonlinear functional forms can appear as nonproportional hazards, the first step is to diagnose the functional forms before moving on to tests for nonproportional hazards. So as a first step, I replicate the model of Beck, Katz and Tucker (1998) in Table 1.

Table 1: Cox Model of International Disputes, 1950-1985 Democracy

−0.33∗ (0.11) −2.70∗ (1.39) 0.11 (0.12) 0.45∗ (0.13) −0.16∗ (0.05) 11.49∗ (5.35) 1.06∗ (0.04) 20,448 -2094.03

Economic Growth Alliance Contiguity Capability Ratio Trade Previous Disputes N lnL

Note: Cell entries are coefficient estimates. Robust standard errors in parentheses. * p < .01 (one-tailed)

With the exception of the allies variable, all the variables in the model are statistically significant. I, next, conduct a Grambsch and Therneau nonproportionality test. Normally, one would wait to test for nonproportional hazards until after the functional form diagnosis is complete. I, however, want to demonstrate how the results from such tests can change after the functional form solution is implemented. We see from the results in Table 2 that all the variables except trade and the capability ratio have time varying effects on the hazard. The next step is to test the three covariates for nonlinear functional forms. The first

10

Table 2: Grambsch and Therneau Nonproportionality Tests International Disputes, 1950-1985 ρ −0.119 −0.251 0.439 −0.251 0.051 0.031 −0.529 –

Democracy Economic Growth Alliance Contiguity Capability Ratio Trade Previous Disputes Global Test

χ2 7.32 45.02 209.49 39.23 2.40 0.21 713.78 767.76

p-value .007

Suggest Documents