Back to Basics: Regression Assumptions Keith Chamberlain
[email protected] 26Jun2017 The purpose of my investigation was to check whether non-constant variance had an impact on my calibration curve or predictions. The basic answer I got back so far is that it depends on whether or not I’m making inferences. In this update, I’ll review the basic assumptions of regression. After that, I’ll share what I found about heteroskedasticity in calibrations in the literature. Cheers! Regression assumptions turn out to be difficult to clarify online, and the information is fraught with error. Everyone has a different list and few are providing evidence. StatisticsSolutions.com (2017) says there’s five key assumptions to regression analysis: “linear relationship, multivariate normality, no or little multicollinearity, no autocorrelation, homoscedasticity.” Specifically, they claim that “linear regression needs the relationship between the independent and dependent variables to be linear.” StatisticsSolutions.com may mean something different than what I think they mean for linearity, but I interpreted their use of linearity to mean strictly linear – a straight line. Nau (2017) claims that there are four assumptions of regression, not five, which are linearity, independence, homoscedasticity, and normality. Specifically, regarding linearity, Nau claims: (a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed. (b) The slope of that line does not depend on the values of the other variables. Nau is clearer on linearity, a straight-line function is used. However, curvilinear relationships between the Xs and Ys isn’t a problem for linear regression if the model is expressed in the following form: 𝑌𝑖 = 0 + 1 𝑋𝑖1 + 2 𝑋𝑖2 + ⋯ + 𝑛 𝑋𝑖𝑛 + 𝑖 where the X variables can be non-independent combinations of other X variables. Unlike Nau’s criteria (b) for linearity above, in an interaction, the slope of an X variable may depend on the values of other variables in the model, and since interactions are just combinations of other X variables, they’re still not a problem. The term linear model refers to the parameters being linear, not necessarily the variables (Neter, Wasserman, & Kutner, 1983, p. 234). Prabhakaran (2016) says there are 10 assumptions of linear regression which I’m not going to bother to list, because hey, I was okay with five, but not 10. Instead of searching online, I’m going to go with something fresh, the assumptions laid out in a book that isn’t laden with statistical jargon. Judd, McClelland & Ryan (2017, pp. 37-38) consider four assumptions for regression: that the errors from a model fit (not the data or variables) are normally distributed, independent, identically distributed, and their mean is zero (the errors are unbiased). Errors being unbiased covers the case where curvature variables are needed in the model to fix the residuals, which regression can still handle. Errors being identically distributed covers the case of heterogeneity of variance and extreme values. Later, when introducing multiple regression, the authors introduce the problem of redundancy among predictors that must be considered (p. 104), but it is not stated as an assumption, as far as I could tell, because the authors outline a clear method of dealing with redundancy. So according to the authors, there are 4 key assumptions to regression, and they have to do with the errors generated from the
difference between the model fit and the data points. The issue then is does heteroscedasticity effect calibration errors in such a way that needs to be dealt with? In calibration for Analytical Chemistry, typically we fit a line that expresses the relationship between instrument responses (e.g. Area of a peak in a chromatography experiment) to different levels of a stimulus (e.g. Concentration or Amount). There are many analytical procedures for which the equal variance assumption is not met. Methods of analytical testing that have this problem include all methods based on counting measurements, including photometric and chromatographic analyses under certain conditions (Schwartz L. , 1979). Schwartz (1979, p. 727) says that if the nonuniform variance is not extreme, there is likely to be little impact of heteroscedasticity, but the difference between heteroscedastic and homoscedastic curve confidence intervals is pronounced, so “an analyst wishing to avoid the labor of a heteroscedastic treatment will sacrifice statistical reliability by ignoring variance nonuniformity” but will get roughly the same results as expected between the heteroscedastic treatment and the homoscedastic one if samples fall near the center of the curve. Thus, there is potentially little impact on the regression coefficients themselves from heteroscedasticity except at the edges, but confidence limits and statistical tests will be impaired. Heteroscedasticity is a particular problem for estimating limits of detection (Desimoni & Brunetti, 2009) because the limits of detection fall at or beyond the limits of the curve where confidence intervals are affected most. Palta (2003, p. 63) indicates that the OLS estimator is unbiased even in the case of heteroscedasticity, as long as the correct model is specified. The caveat “as long as the correct model is specified” might be important, because Spiegelman, Logan & Grove (2011) show, in a reanalysis of data from a previously published study, that the estimate of β1 using y=x, y=x2, y=1, uncorrected OLS was biased in the case of moderate heteroscedasticity. Specifically, when y=x, the uncorrected estimator underestimated β1, when y=x2, β1 was overestimated (Spiegelman, Logan, & Grove, 2011, p. 5). No description of the magnitude of the over/underestimation was given. I suggest that these models were incorrectly specified, because they were not in the form of a multiple linear regression that included all lower level terms. Unfortunately, Spiegelman, Logan & Grove (2011) did not show model diagnostics for the analysis of this study, so readers cannot tell whether the correct models were specified. In contrast to reporting that uncorrected OLS was biased, the authors reported, in the abstract and discussion, that there is little difference between uncorrected least squares estimates of parameters and the corrected ones for moderate heteroscedasticity. Their conclusions thus seem to be in conflict with one of their analyses. Long & Ervin (2000) discuss the use of heteroscedastic consistent standard errors and show how the HC3 covariance matrix outperforms the standard HC0 covariance matrix on N < 250, by using Monte Carlo simulations. The authors state that, in the case of heteroscedasticity, “OLS estimates are unbiased, but the usual tests of significance are inappropriate” (2000, p. 217). The estimators become inefficient even though they are regarded as still unbiased. Inefficiency just means that the estimator does not estimate in the best possible manner. Commonly, weighting of the y-axis (Danzer, 2007, p. 138; Long & Ervin, 2000, p. 217), (less frequently) the x-axis (Long & Ervin, 2000, p. 217), or both (Long & Ervin, 2000, p. 217) is used to compensate for heteroscedasticity when the form of it is known. Weighting becomes impractical when the form of the heteroscedasticity is unknown (Long & Ervin, 2000, p. 217). As a result, other techniques need to be used. Using transforms is an option (Box & Cox, 1964; Yeo & Johnson, 2000; Judd, McClelland, & Ryan, 2017, pp. 333-338). The use of transforms may result in scales that are difficult to interpret, and if the real phenomenon has heteroscedastic error around the correct
functional form, transforms may be inappropriate (Long & Ervin, 2000, p. 217). While the literature tends to point out transforms of the log or square root variety, it is important to understand that a whole range of powers (and others) can be included to find an appropriate transform, one is not simply restricted to log, square root, or Box & Cox transforms (Judd, McClelland, & Ryan, 2017, pp. 3333-338). The technique of using heteroscedasticity consistent covariance matrix or HCCM (Long & Ervin, 2000; White, 1980; Cribari-Neto, 2004) can adjust the affected standard errors when dealing with heteroscedasticity of unknown form so that statistical tests and intervals are made to be appropriate. Of the available techniques, the one called HC3 is recommended (Long & Ervin, 2000; MacKinnon & White, 1985). Finally, to round things out, there’s an iterative reweighted least squares approach that can be used for heteroscedasticity of unknown form (Kutner, Nachtsheim, & Neter, 2004, pp. 424-426). To summarize, if the calibration errors are skewed, or if the correct model is not specified, then parameter estimates in the case of heteroscedasticity may also be biased. Under the typical assumptions (except heteroscedasticity) the OLS estimator is still supposed to be unbiased, though no longer the most efficient. I still do not have a strong idea of what a lack of efficiency means other than the estimates derived do not reduce the error the most among all estimators. I will have to look into this further. For analytical chemists doing chromatography, it seems like inverse calibration is not supposed to be effected in the general case of heteroscedasticity, as long as the peaks between standards and samples being compared fall at the center of the curve, but the other assumptions have to be checked at least during validation to ensure there is actually no bias present and that the correct models are specified. Of course, one can’t know if there’s a “more better model” available when only concentration of an analyte, its area counts, and perhaps the concentration of an internal standard and its area counts are collected. Kutner, Nachtsheim & Neter (2004, p. 426) state that for the case of IRWLS, it typically takes just a couple iterations to get weights that stabilize the variance under heteroscedasticity. I have gas chromatography/flame ionization detector data from a partial validation with 70 calibrated points (ten replicates of 7 levels) that took 17 iterations of a IRWLS function to converge, and the residuals still looked conical. Between the models tested, OLS non-adjusted, one robust technique (lmrob()) from the robustbase package in R, the hccm() technique in package car (which affects the SEs, not the residuals), quantile regression set to fit the median instead of the mean (rq() in package quantile), and another robust regression from package MASS (rlm()), the normalized slopes had a difference range of just over a half of a percent (but note there’s no variance between the OLS fit and the HCCM adjusted estimates), the normalized intercepts 9%, and the normalized squared term 22%. The normalized SEs had a difference range of 79% (Intercept), 55% (slope), and 71% (squared term). So far, the only thing that has stabilized the residuals was a transform of the power of 0.3 on both axes. Then the fitting technique did not matter and the residuals looked superb. The models all gave almost identical results for the normalized parameter estimates as measured by their differences (< 1.5% for the intercept, within 2 ppm for the slope and 10 ppm for the squared terms) and SEs were also very close (< 1% for the intercept, within 250 ppm for the slope and squared terms). The transformed fits can, of course, be inversely transformed during inverse calibrations to yield correct concentrations. I can’t tell which parameter estimates are biased and which ones are not when fitting the nontransformed data, but all of the robust and quantile regression fits had lines that were above the untransformed OLS fit. If the OLS fit is unbiased in these data, then I had bias evident in the robust routines and in the quantile regression routine, especially for lmrob() in the fitted 2nd order term. These
routines may be more sensitive to the spacing of the x-levels, which were not equally spaced, but in the transformed data, the levels were approximately equally spaced. If the robust techniques and quantile regression are considered nonparametric, then perhaps the heterogeneous variance was causing a problem for their fits, too. Perhaps there were other model validation issues that I wasn’t seeing that the robust techniques more closely depend on.
Figure 1: Overlay of several different regressions for data and the non-robust and inverse transformed prediction intervals. The purple fit (a robust regression from robustbase:lmrob()) clearly had a biased squared term. The inverse transformed prediction interval appears to adjust for the heterogeneity of variance while the prediction interval from the regular OLS regression does not.
Figure 2: Transformed calibration data fitted with the various techniques. Both axes were brought to the power of 0.3. Parameter estimates for each model, normalized by the mean of all the parameter estimates of the same type, here have a range of less than 10 ppm for the slope and squared terms, where in the untransformed data, the range was as high as 22%.
While much of the literature made the claim that heterogeneity of variance is not likely to be a problem for inverse calibration in the typical analytical setting, the statement assumed that the only part of the curve that’s being used is the center. In true analytical chemistry settings, the whole range of the calibration curve is used, and the prediction intervals near the low end of the curve, where the calibration points are the tightest, will prove to be too wide to be useful for inverse calibration and limits of detection. If validation procedures can be fortified to include enough injections to assess heteroscedasticity and non-normality, then a suitable transform can be derived for the assay that can be used in typical calibrations moving forward. Note that instrument calibration software does not supply routines suitable for handling the extent of heteroscedasticity I found, and the routines do not operate on both axes. Transforming the axes values using power transforms before inputting them into the quantitative software, and reverse transforming on a result, are trivial steps for an analyst to accomplish routinely.
Bibliography Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211-252. Chamberlain, K. A. (2017, June 17). 70 point Acetic acid curve by GC/FID. Mendeley Data(v1). doi:10.17632/dwf4ddww3w.1 Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics and Data Analysis, 54, 215-233. Danzer, K. (2007). Analytical Chemistry: Theoretical and Metrological Fundamentals. Berlin: SpringerVerlag. Desimoni, E., & Brunetti, B. (2009). About estimating the limit of detection of heteroscedastic analytical systems. Analytica Chimica Acta, 655(1-2), 30-37. Judd, C. M., McClelland, G. H., & Ryan, C. S. (2017). Data Analysis - A Model Comparison Approach to Regression, ANOVA, and Beyond (3rd ed.). New York, New York: Routledge. Kutner, M. H., Nachtsheim, C. J., & Neter, J. (2004). Applied Linear Regression Models (4th ed.). New York, New York: McGraw-Hill/Irwin. Long, J., & Ervin, L. (2000). Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. The American Statistician, 54(3), 217-224. MacKinnon, J. G., & White, H. (1985). Some heteroscedasticity consistent matrix estimators with improved finite sample properties. Journal of Econometrics, 29, 53-57. Nau, R. F. (2017). Regression diagnostics: testing the assumptions of linear regression. Retrieved June 12, 2017, from Statistical forecasting: notes on regression and time series analysis: http://people.duke.edu/~rnau/testing.htm Neter, J., Wasserman, W., & Kutner, M. H. (1983). Applied Linear Regression Models. Homewood, Illinois, USA: Richard D. Irwin, Inc. Palta, M. (2003). Dealing with Unequal Variance Around the Regression Line. In M. Palta, Quantitative Methods in Population Health (pp. 62-96). John Wiley & Sons, Inc. Prabhakaran, S. (2016). Assumptions of Linear Regression. Retrieved from http://rstatistics.co/Assumptions-of-Linear-Regression.html Schwartz, L. (1979). Calibration curves with nonuniform variance. Analytical Chemistry, 51(6), 723-727. Schwartz, L. (1979, 5 1). Calibration curves with nonuniform variance. Analytical Chemistry, 51(6), 723727. Spiegelman, D., Logan, R., & Grove, D. (2011). Regression Calibration with Heteroscedastic Error Variance. The International Journal of Biostatistics, 7(1), 1-34. StatisticsSolutiouns.com. (2017). Assumptions of linear regression. Retrieved from http://www.statisticssolutions.com/assumptions-of-linear-regression/
White, H. (1980). A heteroskedastic consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica, 48, 817-838. Yeo, I.-K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959.