STATISTICS IN MEDICINE Statist. Med. 2008; 27:1508–1526 Published online 3 August 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/sim.3017
Model selection in logistic joinpoint regression with applications to analyzing cohort mortality patterns Michal Czajkowski1 , Ryan Gill1, ∗, † and Grzegorz Rempala1, 2 2 Center
1 Department of Mathematics, University of Louisville, Louisville, KY, U.S.A. for Genetics and Molecular Medicine, University of Louisville, Louisville, KY, U.S.A.
SUMMARY We consider a general model for anomaly detection in a longitudinal cohort mortality pattern based on logistic joinpoint regression with unknown joinpoints. We discuss backward and forward sequential procedures for selecting both the locations and the number of joinpoints. Estimation of the model parameters and the selection algorithms are illustrated with longitudinal data on cancer mortality in a cohort of chemical workers. Copyright q 2007 John Wiley & Sons, Ltd. KEY WORDS:
anomaly detection; occupational cohort; mortality trends; logistic joinpoint regression; model selection
1. INTRODUCTION Identification of recent trends in diseases and, in general, disease trends monitoring has been historically a crucial problem in many areas of epidemiology and public health research and lately also in global security considerations (see Banks [1], Stoto et al. [2]). For instance, in 2003, researchers faced this issue during the worldwide outbreak of the SARS epidemic. In many similar situations, in order to obtain a consistent characterization of population trends in factors related to prevention, early detection, treatment of a disease, or identifying a historic ‘trigger’ event, health researchers apply the statistical methodology based on joinpoint regression. This methodology characterizes a trend using joint linear segments and may be viewed as a special case of nonparametric spline regression with a variable number of knots as described by Eubank [3] or in Chapter 5 of Hastie et al. [4]. These types of models with a single knot have also been introduced ∗ Correspondence †
to: Ryan Gill, Department of Mathematics, University of Louisville, Louisville, KY 40292, U.S.A. E-mail:
[email protected]
Contract/grant sponsor: Publishing Arts Research Council; contract/grant number: 98-1846389 Contract/grant sponsor: National Institutes of Health; contract/grant number: R15 CA106248-01 Contract/grant sponsor: Statistical and Applied Mathematical Sciences Institute
Copyright q
2007 John Wiley & Sons, Ltd.
Received 25 July 2006 Accepted 19 June 2007
LOGISTIC JOINPOINT REGRESSION
1509
in epidemiological studies of occupational exposures in connection with logistic regression models searching for threshold limit values (TLVs) (see Ulm [5], or, more recently, G¨ossl and K¨uchenhoff [6] and the references therein). When considering multiple knot models, an obvious question of knot selection and verification arises. In order to estimate the number of knots, Eubank [3] suggested several procedures based on the use of generalized cross-validation (GCV) and C p -type statistics, and one based on Akaike’s information criterion. These methods suffer a drawback in that they do not provide probabilities of misclassification nor offer a clear way of comparing competing models for different numbers of knot points. This is also true for a variety of Bayesian methodologies put forth in that context, for instance the Gibbs sampling schemes suggested by Stephens [7] or Green [8]. In order to rectify some of the deficiencies in knot selections, Kim et al. [9] suggested a sequential algorithm for testing the number of joinpoints in the model. The usefulness of the methodology has been enhanced by the fact that the free software Joinpoint (see http://srab.cancer.gov/ joinpoint/) has been made available to facilitate fitting and testing of their joinpoint model. The software uses the least-squares criteria under squared-error loss and the goodness-of-fit measure based on the usual F-statistic. It also estimates the p-value of this statistic by an approximate permutation test. Although the method seems appealing for modeling the aggregated disease trends in very large cohorts (e.g. U.S. yearly cancer rates), it suffers from two major potential limitations when applied to smaller cohorts and/or less prevalent diseases. Firstly, the methodology for fitting multiple joinpoints in practice deals only with two types of responses: (i) the Gaussian-like, i.e. continuous, with no restrictions on the response values, and (ii) those coming from the Poisson model. However, in retrospective cohort studies (where typically the yearly person-year counts are available), these types of responses arise only as largesample approximations of the actual mortality counts. Although these are sufficient under many circumstances in large cohorts due to the central limit theorem and the Poisson limit theorem, there are also instances when, for small-to-moderate size cohorts (again, consider the SARS epidemic of 2003), one would be interested in analyzing trends by evaluating exact responses which are modeled either as binary or as sums of clustered-binary variables. An example of clustered-binary responses would be the analysis of incidences of a disease in a cluster of subjects working or living in the same temporal or spatial environment, such as an enclosed production area at a factory floor, a hospital, an apartment building, etc. A second and perhaps even more serious limitation of the existing methodology for fitting joinpoints is its inability to deal with more complex models involving multiple covariates as well as several knots. Indeed, due to the use of the computationally intense grid search method of finding maximum likelihood estimates (MLEs) (see Lerman [10]), the joinpoint model-fitting procedure is slow, and the current version of Joinpoint software allows only for a single covariate. However, in many circumstances, one would also be interested in fitting additional covariates to the model. For instance, when creating a model of a SARS epidemic outbreak, in addition to the time of occurrence of a SARS case, investigators might also like to consider the victim’s age, gender, ethnicity, location, etc. The aim of the current article is to address both of the above deficiencies by (i) extending the general idea of Kim and co-authors to the case of binary and clustered-binary responses and (ii) introducing an alternative search algorithm for finding the MLEs. The algorithm proposed herein is based on the idea of conditioning the model’s likelihood function on the set of joinpoints and utilizing the piecewise convexity of the conditional likelihood. This method is substantially more efficient than the grid search and thus particularly suitable for use with the parametric Copyright q
2007 John Wiley & Sons, Ltd.
Statist. Med. 2008; 27:1508–1526 DOI: 10.1002/sim
1510
M. CZAJKOWSKI, R. GILL AND G. REMPALA
bootstrap and the sequential resampling-based testing method, which we propose to employ in the final determination of the number and the numerical values of the joinpoints (knot points) in the model. Our present consideration of the joinpoint regression model was originally motivated by the analysis of occurrence-type data coming from some longitudinal, retrospective mortality studies. One example of such data is discussed in more detail in Section 5, where we consider non-lung cancer mortality from 1942 to 1995 among the cohort of chemical workers from Louisville, KY. We illustrate our proposed methodology by analyzing some of the historical data on the cohort collected by the Louisville Occupational Health Surveillance Program or OHSP (see Lewis and Rempala [11]). The OHSP data set seems typical of the occupational health databases from the early years of the occupational health monitoring. Among other information, it contains the individual employee work history records, rank-order estimates for exposure to 22 chemicals for every factory job, and mortality records for 2526 chemical workers. Whereas the work histories and other employee characteristics are relatively well documented in the data set, it contains only limited information on chemical exposure, due to the lack of monitoring in the industry’s early years. Following the common surveillance practice for such circumstances, Louisville OHSP researchers have developed an ad hoc quasi-quantitative exposure measure based on the monthly ranking scale, referred to as CERM (Cumulative Exposure Rank Month, see Section 5). The idea of applying a joinpoint model (with a single knot) to binary responses in occupational exposure was originally introduced by Ulm [5] in the context of estimating an unknown TLV. In dose–response relationships, the TLV is a dose of a toxin or a substance which has no influence on the response. In the context of a dose–response model, the joinpoint has a clear interpretation as TLV. For the Louisville cohort mortality data, as well as for many similar data sets, we cannot consider the dose–response model due to the difficulties in assessing the proper doses of the toxin via CERM-like scales. Nonetheless, because of the way they are constructed, the CERM scores could act as surrogates for the exposure homogeneity measure; thus, a CERM-like scale may often be appropriate for clustering the binary responses. In many circumstances (e.g. OHSP data set), factory records detailing the production activity over a particular period of time are available even when the dose estimates of toxin intakes are not. From that perspective, the joinpoints on the time covariate may now be viewed as temporal boundaries of disease clusters (up to a time-bin width) or the ‘trigger’ events, i.e. times at which a particular activity occurred that subsequently affected the frequency response as compared with some underlying baseline. We note that, in analyzing occupational cohorts such as Louisville via a joinpoint model (in contrast to, e.g. P-splines), determining an exact time (joinpoint) of the beginning of the change in mortality pattern is of essence for identifying a ‘trigger’ event. The formalism of testing for a joinpoint (temporal) location also allows us to test retrospectively the hypothesis, say, that a particular production activity was associated with a ‘trigger’ event. In situations such as the one considered in this paper, where there are practical reasons for assuming the existence of one or more joinpoints and also interest in estimating their locations, the joinpoint methods provide a useful means of doing so. On the other hand, it is worthwhile to mention that for simply summarizing the trends in non-linear data there are other knot-based alternatives, such as regression splines, which provide fast ways of obtaining a flexible non-linear fit. In such models, the selection of the number and locations of the knots is important because enough knots are needed to provide a good fit, but using too many knots results in overfitting. There are many ways of making these choices, including penalized and adaptive splines. One penalized Copyright q
2007 John Wiley & Sons, Ltd.
Statist. Med. 2008; 27:1508–1526 DOI: 10.1002/sim
1511
LOGISTIC JOINPOINT REGRESSION
method, known as P-splines, involves regression splines fit by least squares with a roughness penalty (see Eilers and Marx [12] and Ruppert [13]). The current paper is organized as follows. In the next section, we briefly describe the logistic joinpoint regression model with clustered-binomial responses and give two sequential algorithms for the selection of joinpoints. In Section 3, we consider estimating model parameters under the assumption that the number of joinpoints is known. Section 4 provides the details of the backward and forward selection procedures for the number of joinpoints and compares their performance under various scenarios in a simulation study. As already indicated, in Section 5 we provide a numerical example of the application of our method of analysis to some occupational data on historic patterns of cancer mortality in the Louisville cohort of chemical workers. We summarize our findings and offer some conclusions in Section 6. Herein, we are primarily concerned with the case when responses in the regression model are binomial variables. We note, however, that most of the methods proposed may be extended in an obvious manner to an arbitrary generalized linear model (GLM). A beta version of the software that implements most of the work described in this paper in the free statistical software environment R (see e.g. [14] for the description of the R project) is available for download at http://www.math.louisville.edu/∼stats/.
2. LOGISTIC JOINPOINT REGRESSION MODEL As already indicated in the previous section, we are concerned with the retrospective analysis of the occurrence-type data which is motivated by the problem of tracking the mortality from a particular disease over a specific time period. Consequently, we consider the logistic joinpoint regression model in which we observe data points (z1 , t1 , y1 ), . . . , (z N , t N , y N ) where the p-dimensional vectors z1 , . . . , z N are fixed covariates, the ‘times’ t1 < · · ·