INFORMATION THEORY MAKES LOGISTIC REGRESSION SPECIAL Ernest S. Shtatland, PhD Mary B. Barton, MD, MPP Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA
ABSTRACT This paper is a continuation of our previous presentations at NESUG , 1997 and SUGI, 1998 ([1,2]). Our first aim is to provide a theoretical justification for using logistic regression (vs. probit, gompit, or angular regression models, for example). The consensus is that logistic regression is a very powerful, convenient and flexible statistical tool; however, it is completely empirical. Information theory can guide our interpretation of logistic regression as a whole, and its’coefficients; through this interpretation we will demonstrate how logistic regression is special, and unlike other regression models mentioned above. A similar approach will be used to interpret Bayes’ formula in terms of nformation. Our second goal is to propose a test of significance that in case of small samples is superior to the conventional ChiSquare test. This is important because, in addition to the unreliability of Chi-Square, small sizes are typical and unavoidable in many fields including medical and health services research. The proposed test can also be interpreted in terms of information theory. LOGISTIC REGRESSION AND INFORMATION To model the relationship between a dichotomous outcome variable (YES vs. NO, DEAD vs. LIVE, SELL vs. NOT SELL, etc.) and a set of explanatory variables we have a fairly wide “menu” of alternatives, such as logistic, probit, gompit, and angular regression models as examples. In [3] we can find the larger list of 7 alternatives. Only two of them, probit and logit, have received significant attention. According to [3], p. 79, even probit and logit (not to mention the other possible nonlinear specifications) are arbitrary. See also [4], p. 388 about the arbitrariness of the logit models: “ The logit
I(E) = - log P(E)
(2)
Usually log in (2) is the binary logarithm and in this case information is measured in bits. Of course, other bases of the logarithm can be used and as a result information units can vary. Information is as fundamental a concept as probability and there are cases (in particular, in physics
transformation is similar to the probit but on biological grounds is more arbitrary. ” In many sources the opinion has been expressed that the logistic regression is a very powerful, convenient, and flexible statistical tool; however, it is completely empirical, with no theoretical justifications ( [5], p. 164, [6], p. 1724). To the best of our knowledge, [6] is the first and only work that provides us with some theoretical justification for logistic models. However, this justification is given on deterministic grounds - in terms of general systems theory. We will show that logistic regression is special and unlike other regression models mentioned above by justifying it in statistical terms (within information theory) rather than on deterministic grounds as in [6]. This is more natural because logistic analysis is first and foremost a statistical tool. A typical logistic regression model is of the form
log (P / (1 - P)) = b0 + b1 X1 + b2 X2 + ... +bk Xk
(1)
where P is the probability of the event of interest, b0, b1, b2,....,bk are the parameters of the model, X1, X2, ... ,Xk are the explanatory variables, and log is the natural logarithm. In this form the logistic model with the logit link looks really arbitrary, with no advantages over any other model discussed above. As is well-known, for any random event E we have two numbers: the event’s probability P(E) and it’s information I(E) - the information contained in the message that E occurred. These quantities are connected according to the formula and engineering) in which information is even more convenient and natural than probability. Perhaps, logistic regression is one of these cases. Taking into consideration the definition of information (2), it is easy to see that the left side of (1) is the difference in information content between the event of interest E and nonevent NE. The appearance of the information difference (ID) between E and NE seems
logical because logistic regression could be treated as a variant of discriminant analysis when the assumption of normality is not justified (see, for example, [7], p. 232, [8], pp. 19-20, 34-36, or [9], pp. 355-356). That is why the information difference ID could be called also the discriminant information difference.
The interpretability in terms of information is a unique property of logistic regression and constitutes the advantage of the logistic model over probit, gompit and other similar models. INFORMATION AND BAYES’ FORMULA
We will also give a new interpretation to coefficients b0, b1, b2,....,bk in (1). Usually they are interpreted as logarithms of odds ratios. According to [8], p. 41, "This fact concerning the interpretability of the coefficients is the fundamental reason why logistic regression has proven such a powerful analytic tool for epidemiologic research.” And further, see [8], p. 47: “This relationship between the logistic regression coefficient and the odds ratio provides the foundation for our interpretation of all logistic regression results.” But the question arises whether odds ratios themselves are a solid foundation for this interpretation. According to the same authors ([8], p. 42), “The interpretation given for the odds ratio is based on the fact that in many instances it approximates a quantity called the relative risk.” Other authors are not so optimistic about this approximation and the value of odds ratios. For example, the opinions vary from “Odds ratios are hard to comprehend directly” ([10], p.989), to “odds ratio is hard to interpret clinically ” ([11], p. 1233 ), to Miettinen’s opinion that the odds ratio is epidemiologically “unintelligible”. Also, according to Altman , ([9], p. 271): “The odds ratio is approximately the same as the relative risk if the outcome of interest is rare. For common events, however, they can be quite different, so it is best to think of the odds ratio as a measure in its own right.” ([9], p. 271). Altman’s opinion that “it is best to think of the odds ratio as a measure in its own right”, comes especially to the point. We think that logistic regression coefficients also need a new interpretation in their own right and that this interpretation should be done in terms of information. It is easy to show that coefficients b1, b2,....,bk have the meaning of the change in the discriminant information difference (ID) as the corresponding explanatory variable gets a one-unit increase (with statistical adjusting for other variables). For example, if X1 is a dichotomous explanatory variable with values 0 and 1 then
b1 = ID( E vs.NE | X1 =1 ) ID( E vs.NE | X1 =0 )
It is well-known how important is Bayes’ formula for modifying disease probabilities based on diagnostic test results (see for example, [12] and [13]). The most popular and intuitive variant of Bayes’ formula used in medical literature is the odds-likelihood ratio form: posterior odds in favor of disease = (4) prior odds in favor of disease * likelihood ratio or
P( D | R) / P ( ND | R) =
(4')
P( D) / P(ND) * P ( R | D) / P ( R | ND) where D stands for disease, ND - for nondisease, and R for a test result. Even this, “odds-instead-of-probabilities” form is not intuitive enough. The first problem here is similar to the problem with odds ratios: unlike risks, odds are difficult to understand, they are fairly easy to visualize when they are greater than one, but are less easily grasped when the value is less than one ([10], p. 989-990). The second problem with odds is that, although they are related to risk, the relation is not straightforward - these characteristics become increasingly different in the upper part of the scale. The third problem is related to the fact that formula (4') is inherently multiplicative while human thinking grasps more easily additive relationships. This is the reason why researchers working with more conventional forms of Bayes’ formula like (4) or (4'), use sometimes special nomograms and tables to calculate the posterior odds and probabilities ([13], pp. 124 -126). Taking the logarithms of both sides of (4') (in this case the binary logarithm is more appropriate) we arrive at the following relationship between information quantities:
(3)
Thus, equation (1) can be treated as the decomposition of the discriminant information difference between the event E and nonevent NE into the sum of contributions of the explanatory variables. This decomposition in terms of information is linear unlike the original logistic model which is nonlinear in terms of probabilities.
I(D,ND|R) = ID(D, ND)+ID(R|D,ND) (5) where ID(D,ND|R) is the posterior information difference between disease D and nondisease ND given the result R of the test, ID(D,ND) is the corresponding prior information difference, and ID(R|D,ND) is the difference in information contents of the test result between disease
and nondisease cases. In other words, (5) could be reformulated as follows: Discrimination information difference between disease and nondisease after the test = Discrimination information difference between disease and nondisease before the test + Information contained in the test result about the disease/nondisease dilemma.
as a sample estimate of the difference between the information characteristics of the ‘null’ model (b = 0) and the model under consideration. The most natural interpretation of this difference is the information gain, IG, that we have obtain moving from the simplest 'null' model to the fitted model ([17], pp. 163-173). As a result we have
IG*(N /K) = 2(log L(b)-log L(0))/K (8)
Thus the information variant of Bayes’ formula (5) conveys literally what we always imply when working with the conventional Bayes’ theorem: increase and balance of information.
or
LOGISTIC REGRESSION AND SMALL SAMPLES
With Chi-Square approximations being at least questionable or even misleading when the sample size is not large enough, a realistic approach, adopted by some practitioners, is to treat the right side of (8) and (9) as an F statistic with K and N - K - 1 degrees of freedom (see, for example, [3], p. 89 and [2]). In [3] this statistic is called “asymptotic F ”. It is similar to the Wald F statistic (see, for example, [18]). Using the F asymptotic instead of the Chi-Square makes the test of significance more conservative. And the smaller sample size is, the more conservative is the F asymptotic in comparison with the original Chi-Square. By using this “practitioners” approach we arrive at the following equation:
Testing statistical significance of the logistic model is based on the fact that the difference
2(logL(b) - logL(0))
(6)
has approximately a Chi-Square distribution with K degrees of freedom when the number of observations, N is large enough (theoretically, N is infinite). Here logL(b) and logL(0) denote the log-likelihoods of the fitted and “null” (with intercept only) models respectively; b = ( b0,b1,b2, ...,bk), and log is the natural logarithm. The questions immediately arise: what does “large enough” mean, how often the assumption of “large enough” is not satisfied, and what to do in this case. We have to add to these problems the fact that the likelihood-ratio test based on the statistic (6) is too liberal and tends to overparameterize the model ([14], pp. 502 -503 ) for both large and small samples. Usually statisticians avoid these questions but the problem still remains. We will try to address the questions mentioned above. As to the question of how often a small sample situation can be encountered we mention the results of only two papers on meta-analysis of clinical trials( [15], [16]) which demonstrate that small numbers (e.g., less than 50) of participants in the trials are very common. Small samples are also common in many studies in such fields as behavioral sciences, psychology, etc. (especially in the studies of the exploratory rather than confirmatory type). Much more difficult is the practical question of what to do if we do have a small sample.
Fasymptotic
=
IG * (N / K)
(9)
(10)
The term “F asymptotic” is most probably related to the fact that its critical values approximate the corresponding critical values of the Chi-Square as N increases. Thus the Fasymptotic literally becomes the Chi-Square for very large values of N. But as we mentioned above, the limiting Chi-Square is too liberal according to [14]. To make our test of significance even more conservative, we can multiply the asymptotic F statistic (which is equal to the right side in (8) and (9)) by (N-K-1) / N. The result is the “adjusted F” statistic:
Fadjusted
Fasymptotic*( N-K -1 )/N
=
(11)
which is similar to the adjusted Wald statistic ([18], [19]). Our new F statistic can be given in information terms as follows:
Fadjusted
=
IG * ( N - K -1 ) / K
(12)
Note that this formula is the same as we derived in our SUGI’98 paper [2] for linear regression.
As shown in [1], [2], we can think of
2(log L(b) - log L(0)) / N
IG * (N / K) = Chi-Square(K) / K
(7)
Thus, we have a “menu” of tests of significance: from the most liberal Chi-Square usually used in PROC
LOGISTIC to more conservative Fasymptotic, to even more conservative Fadjusted. We propose to use Fasymptotic and Fadjusted not as a substitute but rather as a supplement to the conventional Chi-Square. The superiority of Fasymptotic and Fadjusted statistics over the Chi-Square is apparent. To find the extent of this superiority, we need simulation studies. Finally, we would like to comment on the meaning of Fasymptotic and Fadjusted statistics. It is usually thought that F statistics, in spite of their enormous practical value are not an estimate of anything meaningful. Equations (10) and (11) assign the meaning in terms of information to both F statistics. Fasymptotic has the meaning of information contained in all the data per parameter. Fadjusted is the information gain (per parameter) left after estimating the model parameters by using K + 1 degrees of freedom. Thus, we have a bridge between statistical significance in terms of critical values of the F distribution and substantive significance in terms of information. REFERENCES 1. Shtatland, E. S. & Barton, M. B. (1997). Information as a unifying measure of fit in SAS statistical modeling procedures. NESUG ’97 Proceedings, Baltimore, Me, 875-880. 2. Shtatland, E. S. & Barton, M. B. (1998). An information-gain measure of fit in PROC LOGISTIC. SUGI ’98 Proceedings, Cary, NC: SAS Institute Inc., pp.1194-1199 3. Aldrich, J. H. & Nelson, F. D. (1984). Linear Probability, Logit, And Probit Models, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-045, Beverly Hills and London: Sage Publications, Inc. 4. Armitage, P. & Berry, G. (1987). Statistical Methods in Medical Research, Oxford: Blackwell Scientific Publications. 5. Anderson, S. , Auquier, A. , Hauck W. W., Oakes, D., Vandaele, W., and Weisberg, H. I. (1980). Statistical Methods for Comparative Studies, New York: John Wiley & Sons, Inc. 6. Voit, E. O. & Knapp, R. G. (1997). Derivation of the linear-logistic model and Cox’s proportional hazard model from a canonical system description. Statistics in Medicine, 16, 1705-1729. 7. Munro, B. H. & Page, E. B. (1993). Statistical Methods for Health Care Research, Philadelphia, Pennsylvania: J. B. Lippincott Company. 8. Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression, New York: John Wiley & Sons, Inc. 9. Altman, D. G. (1991). Practical Statistics for Medical Research, London: Chapman & Hall.
10. Davies, H. T. O., Crombie, I. K., Tavakoli, M. (1998). When can odds ratios mislead? BMJ, 316, 989-991. 11. Eccles, M., Freemantle, N., and Mason, J. (1998). North of England evidence based guidelines development project: methods of developing guidelines for efficient drug use in primary care, BMJ, 316, 1232-1235. 12. Ingelfinger, J. A., Mosteller, F., Thibodeau, L. A., and Ware, J. H. (1987). Biostatistics in Clinical Medicine, New York: Macmillan Publishing Co., Inc. 13. Sackett, D. L., Haynes, R. B., Guyatt, G. H.,and Tugwell, P. (1991) Clinical Epidemiology - a Basic Science for ClinicalMedicine, London: Little, Brown and Co. 14. Gelfand, A. E. & Dey, D. K. (1994). Bayesian, model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society, Series B, 56, 501-504. 15. Lau, J., Schmid, C. H., Chalmers, T. C. (1995). Cumulative meta-analysis of clinical trials builds evidence for exemplary medical care. Journal of Clinical Epidemiology, 48, 45-57. 16. Gotzsche, P. C., Johansen, H. K. (1998). Metaanalysis of short term low dose prednisolone versus placebo and non-steroidal anti-inflammatory drugs in rheumatoid arthritis. BMJ, 316, 811-817. 17. Kent J. T. (1983). Information gain and a general measure of correlation. Biometrika, 70, 163-173. 18. Korn, E. L., and Graubard, B. I. (1990). Simultaneous testing of regression coefficients with complex survey data: use of Bonferroni t-statistics. The American Statistician, pp. 270-276. 19. Fellegi, I. P., (1980). Approximate tests of independence and goodness of fit based on stratified multistage samples. Journal of American Statistical Association, 75, 261-268. CONTACT INFORMATION: Ernest S. Shtatland Department of Ambulatory Care and Prevention Harvard Pilgrim Health Care & Harvard Medical School 126 Brookline Avenue, Suite 200 Boston, MA 02215 tel: (617) 421-2671 email:
[email protected] Mary B. Barton, MD, MPP Department of Ambulatory Care and Prevention Harvard Medical School & Harvard Pilgrim Health Care 126 Brookline Avenue, Suite 200 Boston, MA 02215 tel: (617) 421-6011 email:
[email protected]