May 6, 2014 - been the discussion of regression models for dependent variables that ... .org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM].
1997 Forum - Regression Models for Discrete and Limited Dependent Variables
1997 Research Methods Forum
No. 2 (Summer 1997) Regression Models for Discrete and Limited Dependent Variables Michael R. Frone
Research Institute on Addictions, Buffalo, New York As most members of the Research Methods Division are aware, a diverse set of methodological and statistical issues is discussed on RMNET. A recurring theme over the past several years has been the discussion of regression models for dependent variables that do not conform to the requirements of the classic linear regression model. Besides individuals directly asking about a specific regression model, they have raised this issue indirectly. For example, one recent exchange on RMNET focused on the appropriate way to analyze a dependent variable that represented the number of mentors a respondent had. The responses on this variable could only assume four discrete values (0, 1, 2, or 3 = three or more). To address this growing interest in alternatives to the linear regression model, this short article aims to: (1) provide a taxonomy of
common nonlinear regression models, (2) share some key references, and (3) acquaint the reader with two statistical software packages. A Taxonomy of Nonlinear Regression Models
Introduction —Jeffrey R. Edwards, Jeanne C. King Regression Models for Discrete and Limited Dependent Variables —Michael R. Frone Hierarchical Linear Models in Organizational Research: Crosslevel Interactions —Mark A.Griffin, David A. Hofmann What Are We Doing When We Cite Others’ Work in the Methodological Accounts We Provide of Our Research Activities?—Karen Locke Current and Future Research Methods in Strategic Management—Michael A. Hitt Validity, Variance, and the Interpretation of Power Values —Jose M. Cortina Do Structural Equation Models Correct For Measurement Error?—Richard P. DeShon
Most management researchers are familiar with the linear regression model estimated via the method of ordinary least squares (OLS). The linear regression model requires no assumptions regarding the measurement of the independent variables. Independent variables can be dichotomous, nominal, ordinal, or continuous. In contrast, the linear regression model requires that the dependent variable be continuous. The values of the dependent variable are assumed to: (1) range from minus infinity to plus infinity, (2) be any real number, and (3) represent constant units of measurement. Moreover, it is assumed that the dependent variable has been measured for all cases in the sample (e.g., Berry, 1993; Long, 1997). In practice, the measurement of many outcome variables does not meet the assumptions of the linear regression model. As shown in Table 1, two general categories of outcomes—discrete and limited—exist that pose problems for linear regression. Before describing these two types of dependent variables in more detail, it is important to point out that the nonlinear regression models presented in Table 1 were developed to overcome several problems encountered when linear (OLS) regression is used to analyze noncontinuous dependent variables. Although the exact set of problems may differ across the various types of outcome variables, the following four problems are common: (1) nonsensical predicted values (i.e., predicted values falling outside the possible range of the outcome), (2) biased regression coefficients, (3) nonnormally distributed error terms, and (4) heteroscedasticity (e.g., Fox, 1991; Greene, 1997; Long, 1997). The first two problems undermine one’s ability to trust predicted values and the direction and size of estimated relations. The last two problems undermine one’s ability to produce unbiased standard errors and to conduct tests of statistical significance. Discrete Outcomes The first category of dependent variables presented in Table 1 is discrete outcomes. These variables are characterized by the fact that they only assume a finite number of integer values. There are four types of discrete outcomes. Binary, or dichotomous, outcomes take on only two values. Examples include turnover status (0 = has not quit job, 1 = has quit job) and on-the-job substance use (0 = does not use drugs at work, 1 = uses drugs at work). Binary outcomes can be analyzed via probit regression or logistic regression. Ordinal outcomes take on three or more values that can be rank ordered. In addition, an assumption of equal-sized intervals between the response options of ordinal outcomes is usually not warranted. An example is employed parents’ importance ratings for various workplace family-supportive programs (e.g., flextime, job sharing), where 1 = not at all
http://division.aomonline.org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM]
1997 Forum - Regression Models for Discrete and Limited Dependent Variables
important, 2 = somewhat important, 3 = quite important, and 4 = extremely important. Ordinal outcomes can be analyzed via ordered probit regression or ordered logistic regression. Nominal outcomes assume three or more values that cannot be rank ordered. An example is retirement status, where 1 = currently working, 2 = voluntarily retired, 3 = retired because of company policy, and 4 = retired because of illness. A nominal outcome can be analyzed via multinomial logistic regression. In a multinomial logistic regression, K-1 dichotomous outcomes are analyzed simultaneously, where K is equal to the number of values in the nominal outcome variable. In the present example, the three dichotomies might respectively compare groups 2, 3, and 4 to group 1. This would be useful if one wanted to test the general hypothesis that different types of retirement have different causal antecedents.
Table 1 Taxonomy of Nonlinear Regression Models Type of
Dependent Variable
Type of
Regression Model
Discrete Binary
Ordinal
Nominal
Count
Logistic Regression
Probit Regression
Ordered Logistic Regression
Ordered Probit Regression
Multinomial Logistic Regression
Poisson Regression
Negative Binomial Regression
Limited Censored
Tobit Regression
Truncated
Truncated Regression
Sample Selected
Sample Selected Regression
Count outcomes typically take on three or more values that represent the number of times some event occurred during a given interval of time. An example is the number days an individual was absent from work during the past six months. Count variables are characterized by the fact that most individuals have a score of zero and that the proportion of individuals with a specific positive value decreases as the value of the count increases. A count outcome can be analyzed with Poisson regression. However, an assumption of Poisson regression is that the mean and variance of the count variable take on the same value. In other words, E(y) = Var(y). In practice, it is quite common for the Var(y) to be larger than the E(y), which is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors for the predictor variables. When overdispersion is evident, count variables can be analyzed via negative binomial regression. http://division.aomonline.org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM]
1997 Forum - Regression Models for Discrete and Limited Dependent Variables
Limited Outcomes The second category of dependent variables posing problems for the linear regression model is limited outcome variables. Limited outcomes are continuous variables characterized by the fact that their observed values do not cover
their entire range. There are three types of limited outcome variables. Censored outcomes are those where observations are clustered at a lower threshold (left censored), an upper threshold (right censored), or both. The thresholds may be naturally occurring as with the minimum wage when predicting hourly wage rates. The thresholds also may be artifacts of measurement as when annual income is measured in dollars yet the lowest allowable value is $20,000 or less and the highest allowable value is $75,000 or more. Censored dependent variables can be analyzed using tobit regression. Truncated outcomes are those where observations are not sampled at the lower range of values (left truncation), upper range of values (right truncation), or both. In other words, unlike the censored outcome where respondents are collapsed into an upper or lower category, the truncated outcome results when respondents at the lower or upper range of values are excluded from the sample. For example, in the case of annual income, a sample of employed adults may have excluded all respondents who make less than $20,000 and who make more than $75,000. Truncated dependent
variables can be analyzed with truncated regression. One goal of the truncated regression model is to estimate the relation between a predictor, say education, and the truncated outcome, say income, in the population where all values of the dependent variable exist. Even if one is only interested in the relation between education and income in the
subsample of individuals who earn from $20,000 to $75,000, truncated regression will produce slopes and standard errors that are less biased than those obtained from OLS regression. Sample selected outcomes refer to the situation where responses to a continuous variable (Y) are conditional on a dichotomous variable (Z). Consider the example where a researcher wants to study the work-related correlates of
alcohol consumption. One could argue that an alcohol use variable that ranges from zero to some large positive value is confounding two distinct variables. The first variable (Z) is a dichotomy and represents the decision to drink alcohol (0 = nondrinker, 1 = drinker). The second variable (Y) is continuous and represents the amount of alcohol consumed among drinkers. In a sample selection regression model, one first would estimate a probit regression for the dichotomous variable Z to examine the predictors of the decision to drink and to estimate the probability of being a drinker. Then one would estimate a linear regression for the continuous variable Y among drinkers (i.e., when Z = 1). Among the substantive predictors would be a variable that represents the predicted probability of being a drinker obtained from the probit regression. Thus, the linear regression would provide a test of the predictors of alcohol consumption conditional on being a drinker. In other words, the linear regression estimates are adjusted for the fact that being a drinker is the outcome of a nonrandom selection process. However, the standard errors in the linear regression analysis for drinkers need to be adjusted for heteroscedasticity and for measurement error in the selection coefficient. In sophisticated statistical software, maximum likelihood estimation is used to estimate the probit and linear equations. In summary, Table 1 provides a taxonomy of the basic types of regression models for discrete and limited outcome variables. However, many hybrid models are being developed. For example, sample selection has been applied to
ordinal and count outcomes. Moreover, count outcomes can be estimated while taking left and right censoring or left and right truncation into account. To return to the example provided in the first paragraph, a count of mentors that takes on the values 0, 1, 2, and 3 can be analyzed with either right-censored Poisson regression or right-censored negative
binomial regression if overdispersion is evident. Reference Material on Nonlinear Regression Models Until recently, information on the regression models outlined above has been scattered across many statistically-oriented journals and has been presented in a way that is accessible only to the mathematically sophisticated researcher. Although still challenging, various resources have been published recently that are more accessible to the average researcher. A comprehensive treatment of all of the discrete and limited dependent variable regression models presented in Table 1 can be found in Long (1997) and Greene (1997). In addition, Liao (1994) provides a detailed http://division.aomonline.org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM]
1997 Forum - Regression Models for Discrete and Limited Dependent Variables
discussion of the use and interpretation of regression models for discrete dependent variables. Likewise, Breen (1996) presents a detailed discussion of regression models for limited dependent variables. Although more limited in scope, DeMaris (1995) and Menard (1995) provide very useful introductions to logistic regression for binary, nominal, and
ordinal outcomes, and Garner, Mulvey, and Shaw (1995) provide an excellent discussion of models for the analysis of count outcomes. Software to Estimate Nonlinear Regression Models Software to analyze dichotomous outcomes is available in all widely-used statistical packages (e.g., SPSS, STATISTICA, or SAS). However, a comprehensive statistical package that can estimate all of the regression models presented in Table 1, and many hybrid models, is an econometrics program called LIMDEP Version 7 (Greene, 1995). Currently, LIMDEP is a DOS program, with a forthcoming Windows version. More information on LIMDEP and the full manual are available on the world wide web at http://econwpa.wustl.edu/limdep. Another software package that can estimate all of the models in Table 1, except for truncated regression, is Stata Version 5 (Stata Corporation, 1997). More information on Stata can be found at http://www.stata.com. LIMDEP can estimate wider variety of hybrid models than Stata, whereas Stata is available for a wider set of computer platforms than LIMDEP. Conclusion This brief article introduced a set of regression models for discrete and limited outcome variables developed to address problems associated with using the classic linear regression model. However, this should not be interpreted to mean that these alternative regression models come with no costs attached to their use. These alternative regression models come with their own statistical assumptions and limiting conditions, which should be studied carefully. Nonetheless, they represent an important set of analytic tools that are readily available to management researchers. References Berry, W. D. (1993). Understanding regression assumptions. Thousand Oaks, CA: Sage. Breen, R. (1996). Regression models: Censored, sample selected, or truncated data. Thousand Oaks, CA: Sage. DeMaris, A. (1995). A tutorial in logistic regression. Journal of Marriage and the Family, 57, 956-968. Fox, J. (1991). Regression diagnostics. Thousand Oaks, CA: Sage. Garner, W., Mulvey, E. P., & Shaw, E. C. (1995). Regression analysis of counts and rates: Poisson, overdispersed poisson, and negative binomial models. Psychological Bulletin, 118, 392-404. Greene, W. H. (1997). Econometric analysis (3rd Ed.). New York: Prentice Hall. Greene, W. H. (1995). LIMDEP: User's manual (Version 7.0). Bellport, NY: Econometric Software. Liao, T. F. (1994). Interpreting probability models: Logit, probit, and other generalized linear models. Thousand Oaks, CA: Sage. Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. Menard, S. (1995). Applied logistic regression analysis. Thousand Oaks, CA: Sage. Stata Corporation. (1997). Stata statistical software (Release 5.0). College Station, TX: Stata Corp. Copyright © [Research Methods Division, Academy of Management]. All rights reserved. This page maintained by RM Webmasters, Dave Harrison, University of Texas, Arlington, and Martin Evans, Joseph L. Rotman School of Management, University of Toronto. Last updated on 05/09/99.
http://division.aomonline.org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM]
1997 Forum - Regression Models for Discrete and Limited Dependent Variables
http://division.aomonline.org/rm/1997_forum_regression_models.html[5/6/2014 11:20:14 AM]