MODEL SELECTION WITH SNML IN LOGISTIC REGRESSION Antti Liski1 , Ioan Tabus2 and Reijo Sund3 1
Department of Signal Processing, Tampere University of Technology, P.O.Box 553, FIN-33101 Tampere, FINLAND,
[email protected] 2 Department of Signal Processing, Tampere University of Technology, P.O.Box 553, FIN-33101 Tampere, FINLAND,
[email protected] 3 National Institute for Health and Welfare, P.O. Box 30, FI-00271 Helsinki, Finland,
[email protected] ABSTRACT In effectiveness research we are often interested in comparing patient groups from different providers (e.g. hospital districts or hospitals). For this comparison to be directly possible, we should have the patients randomized in such factors that have an unwanted impact on the outcome. If this is not the case, before we can start doing inferences on the outcome, we need to distinguish the relevant variables. We introduce a MDL method for finding these variables of interest. This is done by using the sequential maximum likelihood criterion in logistic regression for register data. The variable selection is done for first time hip fracture patients. We want to adjust for three comorbidities (congestive heart failure, cancer, diabetes) and find how far back in time we should look to identify each of these conditions. 1. INTRODUCTION For direct comparison of treatment practices, we usually need the assumption that the patients are randomly assigned to each hospital. In practice however, patients are assigned to hospitals based on geography or the need for special treatment. Therefore we can not assume a randomized trial. This problem may be handled by adjusting for the unwanted variation and in order to do that we need to find the sources of variation. Obviously we do not want certain patient characteristics (such as age or sex) to affect the outcome. We also want the patients to be as similar as possible in their state of health and therefore we also need to adjust by their medical history. To obtain reliable results on the outcome, we need to find the variables that should be adjusted for. 2. HIP FRACTURE DATA Our dataset consists of the population of first time hip fracture patients from the years 1999−2005 (n = 28797). The set was constrained to those who were 50 years or older and not institutionalized at the time of the fracture. For each patient we have information on certain characteristics (age and sex), the type of the fracture, date of death (if the patient has died) and medical history spanning up
to 10 years before the fracture. By using this history, we have identified three comorbidities (congestive heart failure (CHF), cancer and diabetes) by looking back time periods of different lengths (6 months, 1 year, 3 years, 5 years and 10 years). The further we look back in time, the more occurrences of each comorbidity we are able to find. We have classified the age variable into three classes. All our variables are coded to be dichotomous. 3. THE MODEL There are two variables which we will model. The 90 days mortality variable tells us if a patient has died within 90 days after the fracture and 365 days mortality indicates if the patient died within one year after the fracture. We are using a logistic regression model. As explanatory variables we have age (coded into three groups), sex, fracture type (three groups), a constant and the comorbidity variable. All other variables stay the same between the models except for the comorbidity variable. The comorbidity variable runs through three comorbidities and five time points within each comorbidity. Therefore we have altogether 15 models to compare. The reason why we model mortality is that hip fracture itself doesn’t cause death. Therefore if a patient has died after the fracture, it is because of high age or because the fracture triggered a process which worsened the patient’s health status leading to death. Therefore death after hip fracture works as an indicator that the patient’s health status was already lowered before the fracture. 4. SEQUENTIALLY NORMALIZED MAXIMUM LIKELIHOOD To be able to compare the models we have fitted to our data, we need some model selection criteria. In this paper we consider the sequentially normalized maximum likelihood (sNML) criterion. The reason not to use the normalized maximum likelihood (NML) [1] criterion is that with NML the computation of the normalizing constant becomes an extremely heavy task in logistic regression as the sample size increases [2]. We follow the notation of Tabus & Rissanen [2].
Let us denote the sequence (y1 , y2 , . . . , yn ) by yn and the matrix of regressor variables (x1 , x2 , . . . , xn ) as X n . Roos & Rissanen [3] presented the sequentially normalized maximum likelihood (sNML) function. In the case of logistic regression the sNML function for sequence yn of length n, given data X n , may be written as n Y
Pˆ (yn |X n ) = Pˆ (ym |X m )
Pˆ (yt |yt−1 , X t ),
t=m+1
where ˆ t )) P (yt |yt−1 , X t , β(y . Pˆ (yt |yt−1 , X t ) = t−1 K(y )
(1)
We denote the t:th value in the sequence yt as yt and ˆ = β(y ˆ t ) is the maximum likelihood (ML) estimate for β β calculated by using yt . Let us denote k=
t X
xi yi .
i=1
The numerator in (1) may be written as t t ˆ ˆ = P (y |X , β) , P (yt |yt−1 , X t , β) ˆ P (yt−1 |X t , β)
where ˆ =Q P (y |X , β) t t
t
ˆT k
eβ
i=1 (1
+ eβˆ
T
xi )
.
Notice that the normalizing constant is ˆ (yt )) K(yt−1 ) = P (yt = 0|yt−1 , X t , β 0 t−1 t ˆ + P (yt = 1|y , X , β (yt )),
treatment practices have stayed unchanged over the time interval that we have gathered our data from. A more relaxed assumption is perhaps that the treatment practices have developed over time. This is reasonable because our dataset is from the years 1999 − 2005. We have ordered our dataset by the date of arrival to hospital and computed the criterion sequentially. Because we look for comorbidities by looking back in time for different time periods, we get more occurrences the further we look back in the medical history for each comorbidity. Therefore as we change the time period from say 1 year to 3 years, there will be new patients assigned to each comorbidity. We are also interested how much these ’newly assigned’ patients contribute to the total change in code length as we increase the time period for each comorbidity. 6. RESULTS As expected, the differences in code lengths were not big between the different time intervals (6 months, 1 year, 3 years, 5 years, 10 years). This is because we don’t get that many new occurrences for the comorbidities as we extend the time period. Our results indicated that we should look up to 10 years back in the patient’s medical history for CHF. For cancer already 6 months seemed to be enough medical history. With diabetes the models didn’t seem to really differ from each other, but using diabetes as an explanatory variable with any of the time intervals still improved our model. The results for both mortalities were in line with each other. There were also differences between comorbidities regarding how the ’newly assigned’ patients affected the total change in code length. We also compared the results obtained with sNML to AIC, which was computed non-sequentially. Both criteria gave same kind of results for the different time periods and comorbidities.
1
7. CONCLUSION ˆ (yt ) denotes the ML estimate for β calculated where β a from a data where yt = a, a = 0, 1. The computation of K(yt−1 ), even though we have to do it close to n times, is significantly simpler than the computation of the normalizing constant for the non-sequential NML. We may also calculate the code lengths just for observation t from
Even though the differences in the models compared here were not huge, the choice of extending the time period was well justified for CHF and cancer. Clinical explanations seem to confirm our current findings. The analysis done here may be repeated for a larger number of comorbidities to improve the adjustments used in effectiveness studies. 8. REFERENCES
ˆ t ))] −log[Pˆ (yt |yt−1 , X t )] = −log[P (yt |yt−1 , X t , β(y + log[K(y
t−1
)].
This way we are able to compute the code length for a subset of observations and see how much this subset contributes to the overall code length.
[1] J. Rissanen, “Fisher information and stochastic complexity,” IEEE Transactions on Information Theory, vol. 42, pp. 40–47, Dec. 1996.
5. THE ANALYSIS
[2] I. Tabus and J. Rissanen, “Normalized maximum likelihood models for logit regression,” Festschrift for Tarmo Pukkila on his 60th birthday, pp. 159–172, 2006.
Because we compute the criterion sequentially, we have to have an ordering for our dataset. There is no strict rule how to order our observations in this dataset. If we compute NML non-sequentially, we need to assume that the
[3] T. Roos and J. Rissanen, “On sequentially normalized maximum likelihood models,” in Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-08), Tampere Finland, Aug. 2008.